

An Exploration of AI Hardware Accelerators using HLS4ML











#### Giuseppe Di Guglielmo

Senior ASIC Engineer – Fermilab

Giuseppe Di Guglielmo is a Senior Engineer at Fermilab focused on system-level design and Al/ML hardware acceleration. He develops intelligent, ultra-low-latency detectors for harsh environments, including ML-enabled, radiation-resistant chips for the LHC and quantum hardware for cryogenic systems. With a Ph.D. in Computer Science and over a decade of experience in high-level synthesis for ASIC/FPGA design, he previously held research roles at Columbia University and Tokyo University. He is an active contributor to open-source projects like ESP and hls4ml.







**SPONSORED BY** 

## Inferencing Will Be Everywhere

#### Al can make embedded devices:

- More capable
- More secure
- Safer
- Faster

















Automotive



## Deploying AI in the Edge Systems





#### **Hardware vs Software**

| Pure Software Implementation |                                                                       | Software with generic hardware accelerator |                                      | Software with bespoke hardware accelerator  |                                                  |
|------------------------------|-----------------------------------------------------------------------|--------------------------------------------|--------------------------------------|---------------------------------------------|--------------------------------------------------|
| Pros                         | Cons                                                                  | Pros                                       | Cons                                 | Pros                                        | Cons                                             |
| Very Flexible                | Performance<br>and Timing<br>issues for real-<br>time<br>applications | Retains<br>Moderate<br>Flexibility         | Relies on standard HW                | Very Low power<br>and predictable<br>timing | Requires<br>development of<br>custom<br>hardware |
| Easy to update               |                                                                       |                                            | Power consumption and timing issues- |                                             | Fixed to a limited set of network architectures  |



#### **More and More Models**





Yolo v1 - v8



MobileNet

ResNet

#### Many Many More....

- DenseNet
- AlexNet
- EfficientNet
- SqueezeNet
- VGG
- Inception
- ResNeXt
- More and More.....

## Model Size of Best ImageNet Algorithm





#### Inference Execution



## **Complexity Drives Need for Customization**



## Drivers for ASIC Inferencing on the Edge

#### **Drivers to the edge:**

- Latency
- Security
- Privacy







#### **Drivers to ASIC:**

- Performance
- Efficiency







#### Inferencing on the Edge

- As Al algorithms get more complex, processors, software and off the shelf accelerators will struggle to meet design requirements
- Technology trends are driving edge inferencing to be done on device
- Designing a bespoke accelerator can deliver the highest performance and efficiency
- High-Level Synthesis delivers the fastest path from machine learning framework to RTL







## What is High-Level Synthesis (HLS)?

Automated path from C/C++ or SystemC into technology optimized synthesizable RTL

C/C++ or SystemC

High-Level Synthesis



Synthesizable RTL



#### Generate Synthesizable RTL from C++

Addition operator Optimized for a specific target technology or **FPGA** device dout:rsc.dat(31:0) Output in either VHDL or Verilog Rs(0:0) reg(32,1,0,0,1) module add core ( 17 clk, rst, a rsc dat, b rsc dat, dout rsc dat 18 19 void add(int a, int b, int &dout){ 20 input rst; HLS Clock and reset 21 input [31:0] a\_rsc\_dat; dout = a + b; 22 input [31:0] b\_rsc\_dat; 23 output [31:0] dout rsc dat; 25 26 // Interconnect Declarations 27 wire [31:0] a rsci idat; 28 wire [31:0] b rsci idat; 29 reg [31:0] dout rsci idat; wire [32:0] nl\_dout\_rsci\_idat; 31 32 always @(posedge clk) begin 33 if ( rst ) begin Addition operator 34 dout rsci idat <= 32'b0; 35 36 else begin 37 dout rsci idat = nl dout rsci idat[31:θ 38 39 assign nl dout rsci idat = a rsci idat + b rsci idat; endmodule

#### **Analysis of C++ Descriptions**

High-Level Synthesis analyzes the data dependencies between operations in the algorithm Analysis produces a Data Flow Graph (DFG)

Each node on the DFG represents an operation in the algorithm

Connections between nodes represent data dependencies and indicate order of operations





#### **Analysis of C++ Descriptions**

High-Level Synthesis analyzes the data dependencies between operations in the algorithm Analysis produces a Data Flow Graph (DFG)

Each node on the DFG represents an operation in the algorithm

Connections between nodes represent data dependencies and indicate order of operations





#### **Parallelism**

Parallelism is introduced using loop transformations

Unrolling and pipelining

Unrolling drive parallelism

Pipelining also increases throughput and  $F_{\text{max}}$ 

```
data_t MAC (
   data_t data_in[4],
   coef_t coef_in[4]
) {
   accu_t acc = 0 ;
   for (int i=0;i<4;i++) {
      acc += data_in[i] * coef_in[i] ;
   }
   return acc ;
}</pre>
```





## **Loop Unrolling**





Fully Unrolled - 4x

Loop unrolling provides a way to explore several micro-architectures for a given design

Loops can be fully or partially unrolled



## **Loop Pipelining**

- A single stage pipeline, i.e. no pipelining, has no overlap between loop executions
- Results in data being written every 4 clock cycles
- With no overlap, the resources (the adder) can be shared between all C-Steps





Pipelining with II=2



21

## **Pipelining or Loop Unrolling**

What is the optimal architecture? What makes the most sense for your design?

#### Considerations:

- Data arrival and departure rates
  - Do not create more compute capacity than the communication channels can support
- Throughput vs. latency
  - Is it lower latency or greater throughput more important
- Performance vs. area
  - Smaller usually means slower
- HLS can give the data needed to make these decisions
  - Gantt Chart
  - Reports





#### **Modeling Arbitrary Precision**

Hardware design requires being able to specify any bit-width for variables, registers, etc.

Need to model true hardware behavior and precision to meet specification and save power/area

- Not limited to power-of-two bit-widths (1, 8, 16, 32, 64 bits)
- Integer, fixed-point, and floating-point support

Algorithmic C (AC) data types are C++ classes defined to provide storage for precise hardware mapping in HLS



# **Saturating Math**



#### Overflow:

| 62.5  | 0111111.100 |
|-------|-------------|
| + 2.0 | 10.000      |
|       |             |
| - 1.5 | 1111101.100 |

## REALLY WRONG!

#### Saturation:

| 62.5   | 0111111.100 |
|--------|-------------|
| + 2.0  | 10.000      |
|        |             |
| 63.875 | 0111111.111 |

Close to correct



#### **Smaller is Better**







A one-bit integer multiplier is an "and" gate



## **Data Sizes and Operators**





## **Energy and Operators**





#### **Benefits of High-Level Synthesis**

High-Level Synthesis can help make this process easier, quicker, and flexible



Exploration through design constraints and synthesis settings, not manual recoding

- Evaluate more options than possible with a manual RTL design process
- Automated path from C/C++ or SystemC into technology optimized synthesizable RTL







Synthesizable RTL

**Custom Hardware** 





SPONSORED BY

CONFERENCE

## History of AI/ML Designs w/HLS

Customers have been using HLS for AI/ML designs since 2017

Mostly for **Convolutional Neural Networks** customized in ASIC for **Inferencing** at the edge

Manually optimized bit-widths for lowest area and power

Manually designed custom C++ IP for HLS and adjusted constraints to meet PPA target

Mixture of pure dataflow layer connections and PE-Array architectures





#### Meeting designers where they are

#### Motivation

- A Python env is the de facto standard development platform for AI/ML neural network models
- Generating an efficient hardware implementation from a Python model is tedious and error-prone
- Validation of the accuracy and PPA at the end is often too late
- Recent advances have allowed quantized-aware training using the Python model...
  - ... but those precision details must be manually (re)coded into HDL model



#### **HLS4ML**



#### Introduction

 Provide and efficient and fast translation of machine learning models from open-source packages for training machine learning algorithms to High-Level Synthesis

#### Inspiration

- Originally inspired by the CERN Large Hadron Collider (LHC)
- ML applications have proven extremely useful for large dataset analysis.
- Taking data offline will allow for data to be calculated faster along with sorting data for storage
- Lower Latency, Realtime Detections



#### HLS4ML



#### Solution:

- ASIC and FPGAs have specialized architecture compared to CPUs and GPUs
- Specialized hardware is always able to help with design constraints
- Specialized hardware tend to have lower-power and faster results.



#### Frontends & Backends





#### The Full Flow

```
# MMSIS CDN definition

# Can be modified to increase or decrease number of layer, number of channels # supported based. Design and filetim. Supported based sizes are # 0.0 miletim supported based sizes are fine sizes are fine
```

#### Python



C/C++

High-Level Synthesis

Synthesizable RTL





#### **MNIST** Dataset

The MNIST dataset is included in several popular machine learning packages

Contains 70,000 images:

- Images are 28 x 28 pixels
- Pixels are 8-bit greyscale (1 color plane)

Typically separated training and validation:

- 60,000 images for training
- 10,000 images for verification







#### **MNIST Neural Network**





#### **Accelerator Development**

\* System performance and power measured for 64-bit Rocket Core RISC-V

Profile the execution to determine functions that need acceleration

|                                                    | Weight   | Self Weight | Symbol Name                                                                     |
|----------------------------------------------------|----------|-------------|---------------------------------------------------------------------------------|
| 1995.00 ms                                         | s 100.0% | ó           | mnist(85781)                                                                    |
| 1995.00 ms                                         | 3        | 33.00 ms    | Main Thread 0x1af672                                                            |
| 1962.00 ms                                         | 3        | 0 s         | start                                                                           |
| 1962.00 ms                                         | 3        | 0 s         | main (int, char *)                                                              |
| 1962.00 ms                                         | 5        | 210.00 ms   | test_mnist(int, float*, float*)                                                 |
| 1752.00 ms                                         | 100.09   | 6 70.00 ms  | sw_inference (float*, float *, float*)                                          |
| 1682.00 ms                                         | s 100.0% | 4.00 ms     | load_image(int, float*)                                                         |
| 1678.00 ms                                         | s 100.0% | 6 1.00 ms   | load_weights(int, float*)                                                       |
| 1677.00 ms                                         | 100.09   | 6 18.00 ms  | sw_auto_infer(int, float *, float*)                                             |
| 922.00 ms                                          | 55.3%    | 922.00 ms   | dense_sw(float*, float*, float*, float*, int, int, int, int)                    |
| 1678.00 ms<br>1677.00 ms<br>922.00 ms<br>738.00 ms | 44.2%    | 738.00 ms   | conv2d_sw(float*, float*, float*, float*, float*, int, int, int, int, int, int) |
| 35.00 ms                                           | s 0.5%   | 6 3.00 ms   | softmax(int, float*)                                                            |
| 17.00 ms                                           | 6        | 2.00 ms     | check_results(int, float*, float*)                                              |
| 15.00 ms                                           | 3        | 15.0 ms     | exit()                                                                          |

Convolution and dense layers consume 99.5% of the computational load (excluding test overhead) These will benefit from acceleration



# Feature and Weight Quantization

#### **Post Training Quantization**



#### **Quantized Aware training**





#### Higher levels of abstraction

Catapult AI NN has a simplified Python API for configuring the project and generating the RTL

- Use config\_for\_dataflow to configure the project using only the model and dataset variables
- Use generate\_dataflow to generate the Catapult HLS C++ model, C++ testbench and build scripts
- Use build to generate the RTL

This example is available using the Catapult AI/NN Frontend for HLS4ML



# Reports

#### Layer Report:

- HLS4ML Layer Summary report shows python description of each layer
- nnet layer results report shows PPA for each network layer

| Layer Name        | Layer Class          | Input Type                    | Input Shape | Output Type         | Output Shape        |
|-------------------|----------------------|-------------------------------|-------------|---------------------|---------------------|
| conv2d1           | Conv2D               | ac_fixed<8,1,true>            | [14][14][1] | ac_fixed<16,6,true> | [4][4][5]           |
| relu1             | relu                 | ac_fixed<16,6,true>           | [4][4][5]   | ac_fixed<16,6,true> | [4][4][5]           |
| flatten1          | Reshape              | ac_fixed<16,6,true>           | [4][4][5]   | ac_fixed<16,6,true> | [ 08 ]              |
| dense1            | Dense                | ac_fixed<16,6,true>           | [80]        | ac_fixed<16,6,true> | [10]                |
| softmax1          | Softmax              | ac_fixed<16,6,true>           | [10]        | ac_fixed<16,6,true> | [10]                |
| This report is av | ailable using the Ca | tapult AI/NN Frontend for HLS | S4ML        | Weight Type         | Bias Type           |
| ,                 | <b>3</b> · · · · ·   |                               |             | ac_fixed<16,6,true> | ac_fixed<16,6,true> |
|                   |                      |                               |             |                     |                     |
|                   |                      |                               |             | ac_fixed<16,6,true> | ac_fixed<16,6,true> |



# **Understanding Precision**



High-water mark of data and intermediate values showed range of values was -37 to 56

Float32 (+/-10<sup>38</sup> is excessive)

Sensitivity analysis performed across varying fixed-point representations



# Value Range Analysis

For this example, a fixed-point precision of ac\_fixed<16,6> resulted in 3 numerically different results compared to the floating-point Python output (after quantization)

```
catapult ai nn.run testbench(hls model ccs,0.005)
 Weights directory: ./firmware/weights
 Test Feature Data: ./tb data/tb input features.dat
  Test Predictions : ./tb data/tb output predictions.dat
Processing input 0
Predictions
0 0 1.5e-05 9.2e-05 0 le-06 0 0.99989 le-06 le-06
Quantized predictions
0 0 0 0 0 0 0 .9990234375 0 0
Ref 0.885848 Ref(quantized) .8857421875 DUT 0.879883
                                                          <- MISMATCH
Ref 0.379353 Ref(quantized) .37890625 DUT 0.366211
                                                        <- MISMATCH
Ref 0.616644 Ref(quantized) .6162109375 DUT 0.628906
                                                          <- MISMATCH
INFO: Saved inference results to file: tb data/csim results.log
Error: A total of 3 differences detected between golden Python prediction and C++ pred
iction using threshold of 0.005
```



This tool is available using the Catapult AI/NN Frontend for HLS4ML

#### Customization

#### Add refinements by layer

Measuring the accuracy of this model shows a slight improvement

Python Model Accuracy: 0.9576
C++ Model Accuracy: 0.9497
Python Model Accuracy: 0.9576
C++ Model Accuracy: 0.9498

AREA SCORE: 70125 AREA SCORE: 72275

Does the accuracy increase of 0.0001 warrant and increase in size?



## **Rethinking the Approach - QAT**

Going back to the Python model, you can use QKeras to model the quantization affects at the interfaces of the layers during training

Note that even though QKeras is applying quantization at the interfaces (feature, weights and biases), the internal math operations are still performed as double precision whereas the fixed-point C++ model will use bit-precise fixed-point operations





## **Transferring Your Network**



## **Transferring Your Network**

```
model = Sequential()
model.add(layers.Input(shape=(Fw,Fw, 1), name='input1'))
model.add(OConv2D(filters=5, kernel size=5, strides=3,
          kernel quantizer=quantized bits(8, 1, 1, alpha=1),
          bias_quantizer=quantized_bits(8, 1, alpha=1),
          name='conv2d1'))
model.add(layers.BatchNormalization(name='batchnorm1'))
model.add(layers.Activation('relu', name='relu1'))
model.add(layers.Flatten(name='flatten1'))
model.add(ODense(
          units=10,
          kernel quantizer=quantized bits(8, 1, alpha=1),
          bias quantizer=quantized bits(8, 1, alpha=1),
          kernel regularizer=tf.keras.regularizers.L1L2(0.0001),
          activity regularizer=tf.keras.regularizers.L2(0.0001),
          name='dense1',
) )
model.add(layers.Activation('softmax', name='softmax1'))
```



# **Model Accuracy – Quantizer Bits**

#### **Integer Bit**

|       |   | 8      | 7      | 6      | 5      | 4      | 3      | 2      | 1      | 0      |
|-------|---|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| ທຸ    | 8 | 0.9557 | 0.9537 | 0.9583 | 0.9509 | 0.953  | 0.9421 | 0.907  | 0.8966 | 0.098  |
| Bits  | 7 | 0.9565 | 0.9552 | 0.9569 | 0.9576 | 0.9552 | 0.9459 | 0.941  | 0.9308 | 0.098  |
|       | 6 | 0.9497 | 0.952  | 0.9556 | 0.9496 | 0.9579 | 0.9495 | 0.9469 | 0.9133 | 0.2298 |
| onal  | 5 | 0.9608 | 0.957  | 0.9565 | 0.9532 | 0.952  | 0.9405 | 0.9238 | 0.9211 | 0.098  |
| actio | 4 | 0.9537 | 0.9567 | 0.9519 | 0.9605 | 0.9539 | 0.9492 | 0.9344 | 0.9016 | 0.5703 |
| _     | 3 | 0.9512 | 0.9549 | 0.9553 | 0.951  | 0.9513 | 0.9515 | 0.9408 | 0.9212 | 0.8202 |
| Щ     | 2 | 0.953  | 0.915  | 0.9559 | 0.9576 | 0.9555 | 0.9501 | 0.9413 | 0.9099 | 0.7048 |



# **Design Exploration and Optimizing**

| Conv 2D<br>5x5 filter | Model<br>Accuracy | Area – u² | Bias bits<br>In ROM | Weight bits<br>In ROM |
|-----------------------|-------------------|-----------|---------------------|-----------------------|
| 8int 5p               | 0.9608            | 133255    | 65                  | 1625                  |
| 7int 4p               | 0.9567            | 115933    | 55                  | 1375                  |
| 7int 2p               | 0.915             | 99520     | 45                  | 1125                  |
| 4int 6p               | 0.9579            | 99550     | 50                  | 1250                  |
| 0int 3p               | 0.8202            | 37591     | 15                  | 375                   |

#### Discover the optimal design

- Make informed choices
- Find the smallest design with an optimal accuracy

#### **Key Points**

- As the number of bits decrease the size decreases
- The less bits moving through ROM the less energy used

| Dense<br>10 Ch | Model<br>Accuracy | Area – u² | Bias bit<br>In ROM | Weight bits<br>In ROM |
|----------------|-------------------|-----------|--------------------|-----------------------|
| 8int 5p        | 0.9608            | 813888    | 130                | 10400                 |
| 7int 4p        | 0.9567            | 703025    | 110                | 8800                  |
| 7int 2p        | 0.915             | 597973    | 90                 | 7200                  |
| 4int 6p        | 0.9579            | 596609    | 100                | 8000                  |
| 0int 3p        | 0.8202            | 200393    | 30                 | 2400                  |



## Meeting designers where they are

#### Ease of Use and Optimization

- High-Performance C++ IP Libraries for better hardware
- Enhanced analysis and reporting
- Complete low-power design w/power estimation and optimization
- Integrated Value-Range Analysis (VRA) for detection of quantization/overflow in C++
- C++ Testbench options to measure numerical differences vs Python



This example is available using the Catapult AI/NN Frontend for HLS4ML













#### What is his4ml?

- hls4ml is a Python package for machine learning inference as custom hardware
  - Translate traditional open-source ML models into an HLS project
- Easy to install
  - pip install hls4ml
- Open source
  - https://github.com/fastmachinelearning/hls4ml
  - https://fastmachinelearning.org/hls4ml
- Community
  - Research laboratories, universities, and companies



Software Version v2024.1 February 2024

Support for HLS4ML flow (beta)



# Co-design with hls4ml

- Co-design = development loop between algorithm design, data collection, training, and hardware implementation
  - Large design search space
  - Scientists and engineers with different expertise





# hls4ml origins

- High energy physics
  - Large Hadron Collider (LHC) at CERN
  - Extreme collision frequency of 40 MHz → extreme data rates O(100 TB/s)
    - Most collision "events" don't produce interesting physics
    - "Triggering" = filter events to reduce data rates to manageable levels





# hls4ml has grown

- To a large variety of scientific applications
  - Low latencies (ms → ns)
  - High throughput O(100TB/s)
- ... including teaching material





Neural learning for control, Institute of neuroinformatics, ETH Zurich



# hls4ml community

#### **GitHub**



There are several ways to run the tutorial notebooks:

Online g launch binder





#### hls4ml architecture

- Converts from ML frameworks
- Internal representation
- Configuration to tune latency vs. resources, bit precision
  - | hls4ml knobs | << | HLS knobs |
- Optimizers, e.g. merging layers
- Backends to HLS tools
- nnet\_utils = C++ library of ML functionalities optimized for HLS





## hls4ml supports Catapult HLS





#### hls4ml - Parallelization

- Trade-off between latency and resource usage determined by the parallelization of the logic in each layer
- ReuseFactor = number of times a multiplier is used to do a computation





## Design space exploration via reuse factor

- ReuseFactor = 1, 2, 4
- Other configurations (ignore for now)
  - Streaming Input, On-chip Weights, 32nm ASIC, 10ns Clock, Latency mode





**SIEMENS** 

#### hls4ml – Quantization

- As "customary" in custom hardware, we use quantized representation
  - Floating-point computation is too resource intensive
- Precision = fixed point types
  - ac\_fixed, Algorithmic C Datatypes
  - https://github.com/hlslibs/ac\_types
- Operations are integer ops, but we can represent fractional values
- But we have to make sure we've used the correct data types!
  - Post training quantization
  - Quantization aware training



ac\_fixed<width bits, integer bits, signed>



1.1 computing's energy problem (and what we can do about it), M. Horowitz 2014

High-performance hardware for machine learning, W. Dally 2015



# Design space exploration via (post-training) quantization

Post-training quantization (PTQ) = turning weights from float to fixed (or other quantized format)

Scan integer bits
Fractional bits fixed to 8



Scan fractional bits Integer bits fixed to 6





## **Quantization-aware training (QAT)**

- QAT improves on PTQ
  - Taking into account quantization numerics and learning around them
  - More compact bit representation → Reduction area, power, and latency
  - QKeras <a href="https://github.com/google/qkeras">https://github.com/Xilinx/brevitas</a>
    - Easy to use, e.g. drop-in replacements for Keras layers
      - Dense → QDense
      - Conv2D → QConv2D







# hls4ml – Layer implementations and interfaces

- **hls4ml** is a specialized compiler or transpiler
  - Translate a high-level specification of a model a into HLS-ready code that implements the same algorithms
- User can choose
  - Strategy for the implementation of the layers
    - "Latency" for smaller model where likely the goal is high-parallelism, i.e. low reuse factor
    - "Resource" for larger model and higher reuse factor
  - IOType for the interfaces of layers and overall module
    - "io\_parallel" for data passed as arrays
- "io\_stream" for data passed as latency-insensitive channels, e.g. ac\_channels Algorithmic C Datatypes





## hls4ml configuration in summary

- ReuseFactor: <integer value>
  - Controls the level of parallelism 1 is the most parallel (smallest latency), 2 is half that...
- Precision: <fixed-point data type>
  - Global or per-layer option configuring the precision for feature, weight and bias values
- Strategy: "latency" Of "resource"
  - Selects different C++ architectures for the layer implementations
- IOType: "io\_parallel" or "io\_stream"
  - Passes data either as arrays or latency-insensitive channels, e.g.
- Part : <FPGA part>
  - Identifies the specific FPGA family/part is used in downstream RTL synthesis
- ClockPeriod:<period in ns>
  - Specifies the clock period for HLS



# hls4ml – Heterogenous dataflow architecture

hls4ml instantiates and configures layers of a model in a data flow architecture





## hls4ml – Example

```
from keras import Sequential
                                                                                                                                from keras.layers import Dense, Activation
                                                                                                                               model = Sequential()
                                                                                                                               model.add(Dense(64, input shape=(16,), name='fc1'))
                                                                                                                               model.add(Activation(activation='relu', name='relu1'))
                                                                                                                                                                                                                                                                                                                                                        Model training
                                                                                                                               model.fit(X_train, y_train)
                                                                                                                               y = model.predict(X_test)
                                                                                                                               import hls4ml
                                                                                                                                config = hls4ml.utils.config_from_keras_model(
                                                                                                                                             granularity='name')
                                                Configuration
                                                                                                                                config['Model']['ReuseFactor'] = 2
                                                                                                                                config['Model']['Precision'] = 'ap fixed<16,6>'
                                                                                                                                config['Model']['Strategy'] = 'Latency'
                                                                                                                                                                                                                                                                                                                                                            hls4ml
                                                                                                                               hls4ml model = <a href="hls4ml.converters.convert">hls4ml model</a> = <a href="hls4ml.converters.converters.convert">hls4ml model</a> = <a href="hls4ml.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.converters.co
   Creation of a HLS model
                                                                                                                                             hls_config = config,
                                                                                                                                             io type='io parallel')
Creation of a HLS project
                                                                                                                                hls4ml model.compile()
                                                             Prediction
                                                                                                                                y hls4ml = hls4ml model.predict(X test)
                                                               Synthesis
                                                                                                                                hls4ml_model.build(csim=True, synth=True)
```





# Survey of Big Data sizes in 2021





https://arxiv.org/abs/2202.07659

## Silicon pixel detectors

- Experiments at colliders typically have a silicon pixel detector at the center
  - Concentric rings tiled with sensors
- Silicon sensors are depleted of charge carriers by high voltage
- When a charged particle from a collision passes through, it creates e/h pairs
- Charge is read out and transferred off-detector
  - Charge cluster information is used for physics analysis offline

https://cms.cern/detector







#### Particle tracks and vertices

- Connecting the dots between charge collected in different pixel layers creates a particle track
  - Detector should be low-mass so interactions in inactive material doesn't disrupt this trajectory
- Solenoid magnet immerses the pixel detector in a magneticfield, causing tracks to curve
  - Very curved → low transverse momentum (low-p<sub>T</sub>)
  - Almost straight → high transverse momentum (high-p<sub>T</sub>)
- Reconstructing vertices is critical
  - Secondary vertices help identify particles: long, short, medium lifetime?









#### Designing hardware for the LHC is challenging

- LHC/CMS produces a lot of data
  - New data every 25 ns (p-p collision)
  - Physicists have to throw most of it away
    - Physically and financially challenging
    - Risk to throw away significant information
- Detector is continuously being sprayed with particles
  - Need radiation tolerant on-detector electronics
- High voltage and low temperature requirements
  - Up to -800 V, -35 C







#### **Goal of the Smart Pixel team**

- On-chip data filtering at rate (40 MHz)
- Al algorithms
- Reconfigurable algorithms
- Hybrid pixel detector
  - Silicon sensor
  - Pixelated ROIC
    - Analog front-end + ADC
    - Al in digital logic





## Neural network classifier (filter)

- Inputs are cluster images projected onto y-axis and the associated y<sub>0</sub>
- Three output categories
  - high-momentum (> 200 MeV)
  - low-momentum, negatively charged
  - low-momentum, positively charged
- Simulated dataset of 800,000 clusters
- Classical training and testing set split 80%-20%
- Tensorflow/Keras, 200 epochs for training, 20 epochs of early stopping, 1024 batch size, Adam optimizer





# Filtering in ASIC at LHC

- On-chip data reduction at BX rate
  - R&D for phase III CMS experiments
  - pp-collision 40 MHz
- Integration of the ML algorithm as digital logic with the analog front-end into the in the pixelated area
- Low-power 28nm CMOS
- Total power < 1 W/cm²</li>
  - Analog ~5 µW/pixel
  - Digital ~1 µW/pixel
- Bandwidth saving
  - 54.4% 75.4%

Smart pixel sensors: towards on-sensor filtering of pixel clusters with deep learning, J. Yoo et al. 2023





## Data compression in ASIC at LHC

- Autoencoder (ML) on the detector front-end for data compression
  - ASIC required due to radiation tolerance, handled through triple modular redundancy, and power requirements
- Reconfigurable ASIC to address: evolving LHC conditions (beam related), detector performance (noise, dead channels), and updated performance metric (resolution, new physics signatures)

8" hexagonal silicon module (1 out of ~27,000)



| Metric / requirement    | Value                              |  |  |
|-------------------------|------------------------------------|--|--|
| Rate                    | 40 MHz                             |  |  |
| Total ionizing dose     | 200 Mrad                           |  |  |
| High energy hadron flux | 10 <sup>7</sup> cm <sup>2</sup> /s |  |  |
| Tech. node              | 65 nm LP CMOS                      |  |  |
| Power                   | 48 mW                              |  |  |
| Energy / inf.           | 1.2 nJ                             |  |  |
| Area                    | 2.88 mm <sup>2</sup>               |  |  |
| Gates                   | 780k                               |  |  |
| Latency                 | 50 ns                              |  |  |

Using QKeras, hls4ml, and Catapult HLS

- reduced power by 50%, area by 80%, and achieved 2x better performance reference solutions by optimizing compression and quantization
- Faster design cycle!



#### More ASIC applications with hls4ml and Catapult HLS

- Data compression for X-ray microscopy (ptychography)
- Testing chip at GF 65nm
- Evaluation of algorithms
- PCA vs. Autoencoder

Up to **70x data compression** at source with a **20% increase** in pixel **area** 







- Testing chip at GF 22nm
- SoC with ML accelerator
- Under testing







#### A recent application for FPGA: Plasma control

- Plasma instabilities when magnetic field lines become distorted
  - µ-seconds constraints
- Confinement loss → damage to the reactor
- One of the major roadblocks preventing lasting thermonuclear fusion

| <b>Model Name</b>   | <b>PPCF23 Baseline</b> | <b>QAT+Pruning</b> | -           |
|---------------------|------------------------|--------------------|-------------|
| Image Resolution    | 128×64                 | 128×64             | 32×32       |
| Conv layer filters  | {8,8,16}               | {8,8,16}           | {16,16,24}  |
| Dense layer widths  | {256,64}               | {256,64}           | {42,64}     |
| Total parameters    | 362,730                | 362,730            | 12,910      |
| Parameter precision | PTQ, 18 bits           | QAT, 8 bits        | QAT, 7 bits |
| Sparsity            | none                   | 80%                | 50%         |
| Bit Operations      | 6.74e13                | x                  | 4.52e11     |



http://sites.apam.columbia.edu/HBT-EP











#### hls4ml in summary

- Open source + community
- Python ML package
  - Reads and optimizes ML networks
  - Library of optimized HLS-ready ML functions
  - Dataflow pipeline of hardened layers
  - Easier design space explore for ML implementation
  - Support of Catapult HLS
- Successful for both ASIC and FPGA applications





