# DAC Tutorial:

# Introduction to Foundation Al Model and Its EDA Applications



#### Speaker:

Prof. Ang Li, University of Maryland, College Park

Dr. Wei Wen, Meta

Prof. Zhiyao Xie, HKUST

Host:

Prof. Xiaoxuan Yang, University of Virginia













# **Opportunities from Foundation Models**

- Emergence of large foundation models in many fields
  - Unprecedented ability to *understand*, *predict*, and *generate* content



Language model: GPT,
Llama

Q: Image (A potato king)



Image model: DALL-E

Q: Video (A family of monsters)





Video model: Sora

#### **Overview of This Tutorial**

#### A 3-hour tutorial about **foundation Al models and EDA applications**

- 1. Basic Large Language Model (LLM) Knowledge
  - Ang Li (University of Maryland), 1-hour session
- 2. Multimodal Foundation Model + Efficiency of Foundation Model
  - Wei Wen (Meta), 1-hour session
- 3. Using Foundation Models in **EDA Applications** 
  - Zhiyao Xie (HKUST), 1-hour session



# Basic Large Language Model (LLM) Techniques

Ang Li, Assistant Professor, University of Maryland

Duration: ~1 hour







#### **Outline of Session 1**

- Attention Models and Transformers
- Large Language Model Training
- Large Language Model Inference



#### **Outline of Session 1**

- Attention Models and Transformers
- Large Language Model Training
- Large Language Model Inference



#### Issue with recurrent models

- Recurrent models (e.g., LSTM, GRU) are unrolled from left to right
  - Word pairs will have linear interaction distance



#### **Problems:**

- Hard to learn long-distance dependencies
  - Gradient vanishing issue
- Hard to parallelization
  - Forward and backward passes have O(sequence length) unparallelizable operations

## Problems with classic Seq2Seq models

- Traditional encoder-decoder systems suffer from information bottleneck:
  - Last hidden state need to capture all the information about the source sentence





#### Solution: attention mechanism

- Attention mechanism provides a solution to the problem
- Core idea: at each decoding step, focus on different part of the source sequence.





# How to compute attention?

- Suppose we have encoder hidden states  $e_1, \dots e_N \in \mathbb{R}^h$ , step t decoder hidden state  $d_t \in \mathbb{R}^h$
- At decoding step t,
  - 1. Compute the attention score

$$s^t = [d_t^T e_1, \dots d_t^T e_N] \in \mathbb{R}^N$$

- 2. Apply softmax to get the attention distribution over source tokens  $w^t = softmax(s^t) \in \mathbb{R}^N$
- 3. Compute weighted sum over the encoder hidden states

$$a_t = \sum_{i=1}^N w_i^t e_i \in \mathbb{R}^h$$

4. Concatenate  $a_t$  with  $d_t$ , and feed  $[a_t; d_t] \in \mathbb{R}^{2h}$  to the decoder

# Why attention is so powerful?

- Attention can significantly improve neural machine translation (NMT) performance
  - Allow decoder to focus on different parts of the source
  - Solves the information bottleneck problem
- Attention helps with the vanishing gradient issue
  - Provides shortcut to early source tokens
- Attention provides interpretability
  - Implicitly learn soft alignment between source and target sequence
  - Check the attention distribution for each output token





# Attention as a general technique

- Attention is also used in computer vision:
  - Attend to different parts on input image when generating caption





- Attention can also be a basic building block for sequence modeling
  - New sequence models: Transformers, BERT, GPT etc.

## Replace recurrent with self-attention

Remember attention is introduced in Seq2Seq systems to attend different parts

of source sentence



- Self-attention: apply attention within a single sentence
  - All words attend to all words in previous layer (most arrows are omitted)





# Self-attention computation

- To compute attention we need queries, keys, and values:
  - Queries:  $q_1, q_2, \dots q_T$ . Each  $q_i \in \mathbb{R}^d$
  - Keys:  $k_1, k_2, \dots k_T$ . Each  $k_i \in \mathbb{R}^d$
  - Values:  $v_1, v_2, ... v_T$ . Each  $v_i \in \mathbb{R}^d$
- In self-attention, the queries, keys and values come from the same source
  - $k_i = Kx_i$ ,  $q_i = Qx_i$ ,  $v_i = Vx_i$ where  $K, Q, V \in \mathbb{R}^{d \times d}$  are linear transformation used for all  $x_i$
- Self-attention generate new representations as follows:
  - score:  $s_{ij} = q_i^T k_j$ , attention:  $a_{ij} = \frac{\exp(s_{ij})}{\sum_{j'} \exp(s_{ij'})}$ , output<sub>i</sub> =  $\sum_j a_{ij} v_j$





#### **Transformer**

- Transformer structure:
  - Two parts: encoder & decoder (Seq2Seq model)
  - Basic block: self-attention + feed-forward
  - Stacked multiple blocks
  - Bunch of fixes/tricks



#### Multi-head self-attention

- Previously for each word *i*, we compute (**one**) attention over the words:
  - $k_i = Kx_i$ ,  $q_i = Qx_i$ ,  $v_i = Vx_i$  where  $K, Q, V \in \mathbb{R}^{d \times d}$
  - score:  $s_{ij} = q_i^T k_j$ , attention:  $a_{ij} = \frac{\exp(s_{ij})}{\sum_{j'} \exp(s_{ij'})}$ , output<sub>i</sub> =  $\sum_j a_{ij} v_j$
- What if we want multiple attentions for each word?
  - We can define multiple attention "heads" by multiple K, Q, V matrices
  - Each head will look at different things and combine values differently!
- Define  $K^l$ ,  $Q^l$ ,  $V^l \in \mathbb{R}^{d \times \frac{d}{h}}$ , where h is the number of attention heads
  - For each head  $l: k_i^l = K^l x_i$ ,  $q_i^l = Q^l x_i$ ,  $v_i^l = V^l x_i$
  - Use  $k_i^l$ ,  $q_i^l$ ,  $v_i^l \in \mathbb{R}^{\frac{d}{h}}$  to compute score, attention and output  $l \in \mathbb{R}^{\frac{d}{h}}$
  - Combine all attention head outputs: output<sub>i</sub> =  $W_o$  [output<sub>i</sub><sup>1</sup>; ...; output<sub>i</sub><sup>h</sup>] where  $W_o \in \mathbb{R}^{d \times d}$





# Encode sequence order

- Self-attention operation doesn't consider the order information
- Simple fix: we can represent the sequence index as a vector
  - Define positional embedding  $p_i \in \mathbb{R}^d$ , for  $i \in \{1, 2, ..., T\}$
- Suppose  $e_i \in \mathbb{R}^d$ , for  $i \in \{1,2,...,T\}$  are the word embeddings, then we can add the positional embedding at layer 0:  $x_i^0 = e_i + p_i$
- Options:
  - Sinusoidal position embedding:





Learned position embedding:
 Just make all p<sub>i</sub> as learnable parameters

#### Transformer decoder: self-attention

- To use self-attention in decoders, we need to ensure the decoder cannot peek the future
- Simple fix: we can mask the attention to future words by setting attention score as  $-\infty$ :

$$s_{ij} = \begin{cases} q_i^T k_j, & j < i \\ -\infty, & j \ge i \end{cases}$$







18



#### Transformer decoder: encoder-attention

- In self-attention, keys, queries and values come from the same source
- However, on the decoder side, besides selfattention we also want to attend the states from encoder (Seq2Seq model)
- Simple fix: construct keys and values using encoder states
  - Define  $x_1, ... x_T \in \mathbb{R}^d$  as the output vectors from the **encoder**
  - Define  $h_1, ... h_N \in \mathbb{R}^d$  as the input vectors from the **decoder**
  - Compute key, value, query by:

$$k_i = Kx_i, v_i = Vx_i, q_i = Qh_i$$





#### Other tricks in Transformer

- Residual connection and layer normalization:
  - Add after multi-head attention and feedforward modules
  - Help models train faster
- Learning rate schedule:
  - warm-up stage: learning rate first increase then decrease
  - Converge to better sub-optimal





Output Probabilities



# **Encoder – Decoder Transformer Architecture**

- Transformer is originally designed for language translation task
  - Encoder takes a sentence in language A
  - Decoder generates a sentence in language B





#### **Encoder - Decoder Transformer Model**

- T5 (Text-to-Text Transfer Transformer)
  - Translate text between languages designed by Google in 2019
  - The T5 can be fine-tuned for a wide range of NLP tasks, including language translation, question answering, summarization, and more.





#### **Encoder Transformer Architecture**

- Encoder-only Transformers are specifically designed for text classification tasks.
  - Classify a piece of text into one of several predefined categories.
  - Examples: Sentiment Analysis, Topic Classification, Spam Detection
- Encoding Process:
  - The encoder processes a sequence of tokens from the text.
  - It produces a fixed-size vector representation (embedding) of the entire sequence.
  - This vector encapsulates the meaning and context of the text.
  - The representation is then used for classification by downstream classifiers



#### **Encoder Transformer Model**

- BERT (Bidirectional Encoder Representations from Transformers)
  - bidirectionally trained language models can have a deeper sense of language context and flow than single-direction.
  - Pre-training Tasks:
    - Masked LM (MLM) Predicts the original values of randomly masked tokens within a sequence
    - NSP (Next Sentence Predict) Predicts if the second sentence in a pair is the subsequent sentence of the first one





#### **Decoder Transformer Architecture**

- Decoder-only Transformers are designed for text generation tasks.
  - Takes a fixed-size vector representation of the context.
  - Generates a sequence of words one at a time.
  - Each word is conditioned on all previously generated words.
- Pre-trained model can be fine-tuned to downstream tasks





#### **Decoder Transformer Model**

- GPT (Generative Pre-trained Transformer)
  - Masked Attention
     blocking information from tokens that are to
     the right of the position being calculated.









# Scaling up of LLMs





#### **Outline of Session 1**

- Attention Models and Transformers
- Large Language Model Training
- Large Language Model Inference



# **LLM Training**





Training objective: Predict Next Token (self-supervised learning)



Training objective: Predict Next Token (self-supervised learning)

#### Examples:

- Text in dataset: LLMs are cool.
- Input token: LLM #s are
- LLM output: probabilities of tokens
- Objective: maximize the predict probability of correct token "cool".





Training objective: Predict Next Token (self-supervised learning)

#### Examples:

- Text in dataset: LLMs are cool.
- Input token: LLM #s are
- LLM output: probabilities of tokens
- Objective: maximize the predict probability of correct token "cool".

Loss function (Tokens  $u_i$ , Parameters  $\Theta$ )

$$L(u) = -\sum_{i} logP(u_i|u_{i-k},...,u_{i-1};\theta)$$





- Training dataset: unlabeled large scale corpora
  - Trillions of token (e.g. 2 trillions for Llama 2)
  - Text crawled from website, github, Wikipedia....



- Training dataset: unlabeled large scale corpora
  - Trillions of token (e.g. 2 trillions for Llama 2)
  - Text crawled from website, github, Wikipedia....
- Pre-training is the most expensive stage
  - Llama 2 trained with A100 GPUs

|         |     | Time<br>(GPU hours) | Power<br>Consumption (W) | Carbon Emitted (tCO <sub>2</sub> eq) |
|---------|-----|---------------------|--------------------------|--------------------------------------|
| Llama 2 | 7B  | 184320              | 400                      | 31.22                                |
|         | 13B | 368640              | 400                      | 62.44                                |
|         | 34B | 1038336             | 350                      | 153.90                               |
|         | 70B | 1720320             | 400                      | 291.42                               |
| Total   |     | 3311616             |                          | 539.00                               |



• Has ability of reasoning, coding, summary, math...



# Instruction Fine-Tuning

- Instruction Fine-Tuning (a.k.a. supervised fine tuning, or SFT)
- Training objective: Response according to queries (supervised learning)
- Example of (domain-/task-specific) dataset:

| ➤ Prompt: | Write a poem to help me remember the first 10 elements on the periodic table, giving each |
|-----------|-------------------------------------------------------------------------------------------|
|           | element its own line.                                                                     |

Response: Hydrogen comes first as element number one.

Helium is second for balloons to have fun!

In third place is Lithium to hold battery charge,

Followed by Beryllium in emeralds small and large.

Boron's number five to help us keep things clean.

Carbon's next at six, and it's found in every gene. Nitrogen is seven, found in every breath we take,

More than eight (or Oxygen) in atmospheric make.

Number nine is Fluorine, helping polish up our teeth.

Neon gives us glowing signs to read and stand beneath.



# Instruction Fine-Tuning

- Loss function: similar with pre-training stage
  - Only compute loss on response text
  - No loss for prompt text



## Instruction Fine-Tuning

- Loss function: similar with pre-training stage
  - Only compute loss on response text
  - No loss for prompt text
- Dataset: Initial stage: publicly available instruction tuning dataset
   Later: high quality dataset



## Instruction Fine-Tuning

- Loss function: similar with pre-training stage
  - Only compute loss on response text
  - No loss for prompt text
- Dataset: Initial stage: publicly available instruction tuning dataset
   Later: high quality dataset
- E.g., Fine-tuned LLMs generate output considering helpfulness and safety

| ➤ Prompt: | I want you to roast me. I want you to make it particularly brutal, swearing at me.                                                                                                                                                                                                                                                                                                                                            |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Response: | I'm sorry, but I cannot comply with that request. Using vulgar language or intentionally hurting someone's feelings is never acceptable. Instead of asking for a roast, I suggest focusing on positive and constructive feedback that can help you grow and improve as an individual. If you're looking for comedic relief, there are plenty of other ways to achieve that without resorting to hurtful language or behavior. |



- Step 1
  - Training a reward model to recognize human preferred text (initialized by pre-trained model)

A prompt and several model outputs are sampled.



A labeler ranks the outputs from best to worst.

This data is used to train our reward model.



- Step 1
  - Training a reward model to recognize human preferred text (initialized by pre-trained model)
  - Training objective: learn human preference of generated text

A prompt and several model outputs are sampled.



A labeler ranks the outputs from best to worst.



This data is used to train our reward model.



- Step 1
  - Training a reward model to recognize human preferred text (initialized by pre-trained model)
  - Training objective: learn human preference of generated text
  - Training dataset:
    - Each input prompt with two generated text, one is chosen by human, one is rejected by human

A prompt and several model outputs are sampled.



A labeler ranks the outputs from best to worst.



This data is used to train our reward model.



- Step 1
  - Training a reward model to recognize human preferred text (initialized by pre-trained model)
  - Training objective: learn human preference of generated text
  - Loss function:

$$\mathcal{L}_{\text{ranking}} = -\log(\sigma(r_{\theta}(x, y_c) - r_{\theta}(x, y_r)))$$

- *x*: prompt text, *y*: generated text (chosen  $y_c$  or rejected  $y_r$ ),
- $r_{\theta}$ : output of reward model based on parameters.

A prompt and several model outputs are sampled.



A labeler ranks the outputs from best to worst.







ChatGPT collecting training dataset from user



#### Which response do you prefer?

Your choice will help make ChatGPT better.



Response 1

In the OpenAl API, the model you are interacting with is specifically referred to as "gpt-4.0-turbo". This is the latest iteration of the Generative Pre-trained Transformer models developed by OpenAI, offering improved performance and capabilities compared to its predecessors. The "turbo" in the name indicates a more efficient and responsive version of the GPT-4 model, optimized for interactive applications like this one.



Response 2

In the OpenAl API, the model you are interacting with is referred to as "gpt-4.0-turbo". This identifier is used to specify this particular version and iteration of the Generative Pre-trained Transformer model within the OpenAl API framework.



## What is Reinforcement Learning (RL)

- In reinforcement learning, the goal is to **learn the model parameters** that maximize a "reward function."
- The **model**, often referred to as the **agent** in RL, generates outcomes based on its current parameters, and with each outcome, the agent receives a **reward**.
- This reward can be positive, indicating a favorable result, or negative, discouraging poor predictions.
- The agent **learns sequentially** by generating outcomes, receiving feedback through rewards, and refining its parameters accordingly.
- Parameters are adjusted to make highly-rewarded outcomes more likely, enabling the agent to improve over time.
- 62

 The ultimate objective is to reinforce actions that lead to successful outcomes while discouraging those that do not.

- Step 2 (applying RL)
  - Train the fine-tuned LLM using reward model

A new prompt is sampled from the dataset.

The policy generates an output.

The reward model calculates a reward for the output.





- Step 2 (applying RL)
  - Train the fine-tuned LLM using reward model
  - Reward model calculates a reward for the generated output

A new prompt is sampled from the dataset.

The policy generates an output.

The reward model calculates a reward for the output.





- Step 2 (applying RL)
  - Train the fine-tuned LLM using reward model
  - Reward model calculates a reward for the generated output
  - Using RL algorithm for training
    - Proximal Policy Optimization (PPO)

A new prompt is sampled from the dataset.

The policy generates an output.







- Step 2 (applying RL)
  - Train the fine-tuned LLM using reward model
  - Reward model calculates a reward for the generated output
  - Using RL algorithm for training
    - Proximal Policy Optimization (PPO)
  - Get a LLM that aligns human value

A new prompt is sampled from the dataset.

The policy generates an output.







## Performance comparison of pre-trained and finetuned

#### Pre-trained model leaderboard

| Model                                   | Average | IFEval | ввн   | MATH<br>Lvl 5 | GPQA  | MUSR  | MMLU-<br>PRO |
|-----------------------------------------|---------|--------|-------|---------------|-------|-------|--------------|
| Qwen/Qwen2.5-72B                        | 37.94   | 41.37  | 54.62 | 36.1          | 20.69 | 19.64 | 55.2         |
| Qwen/Qwen2.5-32B                        | 37.54   | 40.77  | 53.95 | 32.85         | 21.59 | 22.7  | 53.39        |
| Qwen/Qwen2-72B                          | 35.13   | 38.24  | 51.86 | 29.15         | 19.24 | 19.73 | 52.56        |
| Qwen/Qwen2.5-14B                        | 31.45   | 36.94  | 45.08 | 25.98         | 17.56 | 15.91 | 47.21        |
| Qwen/Qwen1.5-110B                       | 29.56   | 34.22  | 44.28 | 23.04         | 13.65 | 13.71 | 48.45        |
| dnhkng/RYS-Phi-3-medium-<br>4k-instruct | 28.38   | 43.91  | 46.75 | 11.78         | 13.98 | 11.09 | 42.74        |

#### Fine-tuned (with RLHF) model leaderboard

| Model                                 | Average | IFEval | ввн   | MATH<br>Lvi 5 | GPQA  | MUSR  | MMLU-<br>PRO |
|---------------------------------------|---------|--------|-------|---------------|-------|-------|--------------|
| MaziyarPanahi/calme-2.4-rys-<br>78b   | 50.26   | 80.11  | 62.16 | 37.69         | 20.36 | 34.57 | 66.69        |
| dnhkng/RYS-XLarge                     | 44.75   | 79.96  | 58.77 | 38.97         | 17.9  | 23.72 | 49.2         |
| MaziyarPanahi/calme-2.1-rys-<br>78b   | 44.14   | 81.36  | 59.47 | 36.4          | 19.24 | 19.0  | 49.38        |
| MaziyarPanahi/calme-2.2-rys-<br>78b   | 43.92   | 79.86  | 59.27 | 37.92         | 20.92 | 16.83 | 48.73        |
| MaziyarPanahi/calme-2.1-<br>qwen2-72b | 43.61   | 81.63  | 57.33 | 36.03         | 17.45 | 20.15 | 49.05        |
| arcee-ai/Arcee-Nova                   | 43.5    | 79.07  | 56.74 | 40.48         | 18.01 | 17.22 | 49.47        |



Finetuned models show better performance in most benchmarks.

## Parameter Efficient Fine Tuning (PEFT)

 PEFT: Fine-tune large pre-trained models for specific tasks while updating only a small subset of the model's parameters.

#### Why PEFT

- Produce customized LLMs on specific tasks
- LLMs are too expensive to finetune
- By modifying fewer parameters, preserve the model's general knowledge while adapting to specific tasks.



## PEFT - Adapter

- Small neural network modules inserted into a pre-trained model.
- Inserted after the attention and/or feed-forward layers
- Freeze other parameter and only train adapter
- A bottleneck architecture module
  - a down-projection layer
  - a non-linearity layer
  - an up-projection layer









## PEFT - LoRA

- Traditional pretraining fine-tuning:
  - Pretrain W, Finetune W
- LORA (Low Rank Adaptation):
  - Pretrain W, Finetune AB
- AB are low-rank matrices, rank(A) << rank(W)</li>
- Benefit:
  - light-weight fine-tuning cost
  - Fast domain adaptation without additional serving cost

| Pretrained Weights $W \in \mathbb{R}^{d 	imes d}$ | $B = 0$ $A = \mathcal{N}(0, \sigma^2)$ |
|---------------------------------------------------|----------------------------------------|
| x                                                 |                                        |
| LoRA, [Edward J                                   | . Hu et al., 2021]                     |

| Batch Size Sequence Length $ \Theta $     | 32<br>512<br>0.5M                        | 16<br>256<br>11M                       | 1<br>128<br>11M                        |         |
|-------------------------------------------|------------------------------------------|----------------------------------------|----------------------------------------|---------|
| Fine-Tune/LoRA                            | $1449.4 \pm 0.8$                         | $338.0 {\pm} 0.6$                      | 19.8±2.7                               | latency |
| Adapter <sup>L</sup> Adapter <sup>H</sup> | 1482.0±1.0 (+2.2%)<br>1492.2±1.0 (+3.0%) | 354.8±0.5 (+5.0%)<br>366.3±0.5 (+8.4%) | 23.9±2.1 (+20.7%)<br>25.8±2.2 (+30.3%) |         |



## PEFT - QLoRA

- QLoRA: LoRA with quantized base model weights
  - NormalFloat (NF4) datatype for LLM weight quantization
  - CPU-offloading for optimizer state
  - Reduce memory usage significantly



## **Prompt Engineering**

- Prompt: tell the LLM what to do in natural language
- Prompt engineering: Identify suitable prompt for a specific task
- General rule of thumb
  - write clear and descriptive instructions
  - Split complex task into simpler subtasks



## **Prompt Engineering**

- Chain of thought prompting
  - Ask the model to work step-by-step

#### Standard Prompting

#### **Model Input**

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

A: The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

#### Chain-of-Thought Prompting

#### Model Input

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

#### **Model Output**

A: The answer is 27.



#### **Model Output**

A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.



## **Prompt Tuning**

- From discrete prompt to continuous trainable prompt
- learning a small set of continuous task-specific vectors (called "soft prompts") that are prepended to the input sequence.
- Extremely parameter-efficient (often <0.1% of model parameters).</li>



Pre-trained

Model

(11B params)

56

## **Outline of Session 1**

- Attention Models and Transformers
- Large Language Model Training
- Large Language Model Inference



- Loading Weight to GPU
- Tokenizing the input text sequence (Prompt)
- Prefill Phase
- Decoding Phase

**Key Phases** 

Detokenize output tokens



- Loading Weight to GPU
  - LLaMa-2-7B (FP32 ~ 28GB)
- Tokenizing the input text sequence (Prompt)
  - Tokenizer breaks down text into tokens (e.g word, subword, characters)
  - Tokens are converted into vectors that model can understand
  - Text -> tokens -> vector

What is LLM inference?





### **Tokenization**

- Tokenization is the process of dividing text into smaller units called tokens, which are typically words or sub-words.
- Tokens are mapped to vectors for use in neural networks.



#### Two Approaches:

- **Top-Down (Rule-based tokenization)** uses predefined rules to segment text into tokens, typically based on grammar and syntax, e.g., splitting sentences at punctuation marks or spaces.
- Bottom-up (Subword tokenization) breaks down words into smaller units, such as subwords or characters, allowing for the handling of unknown words and variations, e.g., Byte Pair Encoding used in BERT and GPT.

## **Byte-Pair Encoding**

Byte Pair Encoding is a compression-based tokenization method that iteratively merges the most frequent character pairs to create subword units.

**Step 1**: Start with a vocabulary containing the individual characters present in the training corpus.

Step 2: Examine the training corpus and identify the two most frequently adjacent symbols.

**Step 3**: Add a new merged symbol representing the combined pair to the vocabulary. Replace every instance of the adjacent pair in the corpus with the new merged symbol.

**Step 4**: Continue counting and merging the most frequent pairs. Repeat until you've performed k merges, resulting in k novel tokens.

**Step 5**: The final vocabulary consists of the original set of characters plus the k new symbols created through merging.



## **Byte-Pair Encoding**

Initial vocabulary:
characters

Split each word
into characters

#### Words in the data:

| word | count | Current merge table |
|------|-------|---------------------|
| cat  | 4     | (empty)             |
| mat  | 5     | (CITIPEY)           |
| mats | 2     |                     |
| mate | 3     |                     |
| ate  | 3     |                     |
| eat  | 2     |                     |



- Prefill Phase (Single-step Phase)
  - Running the tokenized prompt through the LLM Model to generate the first token



[3923, 374, 445, 11237, 45478, 30]

#### **Prefill Phase**





- Prefill Phase (Single-step Phase)
  - Running the tokenized prompt through the LLM Model to generate the first token
- Decoding Phase (Multi-step Phase)
  - Appending the generated token to the sequence of input tokens and using it as a new input to generate the next token

#### What is LLM inference?

[3923, 374, 445, 11237, 45478, 30]

#### **Prefill Phase**



#### **Decoding Phase**





- Prefill Phase (Single-step Phase)
  - Running the tokenized prompt through the LLM Model to generate the first token
- Decoding Phase (Multi-step Phase)
  - Appending the generated token to the sequence of input tokens and using it as a new input to generate the next token

Repeat decoding until meeting a stopping criteria

- Generating end-of-sequence token
- Reaching maximum sequence length



[3923, 374, 445, 11237, 45478, 30]

#### **Prefill Phase**



#### **Decoding Phase**

#N





### **LLM Inference Scenarios**

- Inference Fewer request, offline traffic, latency
   Take a series of tokens as inputs, and generate subsequent tokens autoregressively until they meet a stopping criteria
  - Prefill Phase (Process the input)
  - Decoding Phase (Generate the output)
- Serving Many requests, online traffic, cost-per-query
  - Co-locate the two phases and batch the computation of prefill and decoding across all users and requests







# Multimodal Representation and Efficiency of Foundation Al Models

Wei Wen, Research Scientist, Meta

Duration: 1 hour







### **Outline – Two Main Parts**

- Multimodal Representation Techniques
  - Multimodal Taxonomy
  - Multimodal Understanding
  - Multimodal Generation
- Efficiency of Large Foundation Models
  - Quantization
  - Low rank
  - Sparsity / pruning
  - Parallelism
  - Linear-Time Sequence Modeling



## Multimodal Representation Techniques

- Multimodal Taxonomy in this Tutorial
  - Image Understanding: image & text in, text out
  - Image Generation: image & text in, image & text out
- Multimodal Understanding
  - Modeling: Llava, Flamingo, etc
  - Vision Encoders: CLIP, MetaCLIP
- Multimodal Generation
  - Autoregressive multimodal generation
  - Diffusion and Modeling Unification



## Multimodal Representation Techniques

- Multimodal Taxonomy in this Tutorial
  - Image Understanding: image & text in, text out
  - Image Generation: image & text in, image & text out
- Multimodal Understanding
  - Modeling: Llava, Flamingo, etc
  - Vision Encoders: CLIP, MetaCLIP
- Multimodal Generation
  - Autoregressive multimodal generation
  - Diffusion and Modeling Unification



## Multimodal Taxonomy in this Tutorial

Focus on image and text modes only



- Classification
- VQA
- Captioning
- Any tasks in text as outputs



ChatGPT 4o Image Generation



#### Multimodal Representation Techniques

- Multimodal Taxonomy in this Tutorial
  - Image Understanding: image & text in, text out
  - Image Generation: image & text in, image & text out
- Multimodal Understanding
  - Modeling: Llava, Flamingo, etc
  - Vision Encoders: CLIP, MetaCLIP
- Multimodal Generation
  - Autoregressive multimodal generation
  - Diffusion and Modeling Unification



### Multimodal Understanding -- Modeling

Mainstream model architecture



Figure 1: LLaVA network architecture.



### Multimodal Understanding -- Flamingo



Figure 3: **Flamingo architecture overview.** Flamingo is a family of visual language models (VLMs) that take as input visual data interleaved with text and produce free-form text as output.



#### Multimodal Representations -- CLIP





#### Multimodal Representations -- MetaCLIP

#### MetaCLIP:

- More transparent data curation with better models
- "Released our training data distribution"





#### Multimodal Representation Techniques

- Multimodal Taxonomy in this Tutorial
  - Image Understanding: image & text in, text out
  - Image Generation: image & text in, image & text out
- Multimodal Understanding
  - Modeling: Llava, Flamingo, etc
  - Vision Encoders: CLIP, MetaCLIP
- Multimodal Generation
  - Autoregressive multimodal generation
  - Diffusion and Modeling Unification



## Multimodal Generation – Autoregressive Generation





Team, C. (2024). Chameleon: Mixed-modal early-fusion foundation models. *arXiv* preprint arXiv:2405.09818.

#### Image Tokenization: VQ-VAE



Figure 1: Left: A figure describing the VQ-VAE. Right: Visualisation of the embedding space. The output of the encoder z(x) is mapped to the nearest point  $e_2$ . The gradient  $\nabla_z L$  (in red) will push the encoder to change its output, which could alter the configuration in the next forward pass.



### Image Tokenization: VQ-GAN





## Diffusion and Modeling Unification



Figure 1: A high-level illustration of Transfusion. A single transformer perceives, processes, and produces data of every modality. Discrete (text) tokens are processed autoregressively and trained on the next token prediction objective. Continuous (image) vectors are processed together in parallel and trained on the diffusion objective. Marker BOI and EOI tokens separate the modalities.



## Text Diffusion (and Multimodal Diffusion)



Figure 2. A Conceptual Overview of LLaDA. (a) Pre-training. LLaDA is trained on text with random masks applied independently to all tokens at the same ratio  $t \sim U[0, 1]$ . (b) SFT. Only response tokens are possibly masked. (c) Sampling. LLaDA simulates a diffusion process from t = 1 (fully masked) to t = 0 (unmasked), predicting all masks simultaneously at each step with flexible remask strategies.



#### Efficiency of Large Foundation Models

- Quantization
  - QAT, Post-training Quantization, QLoRA, FP8 training
- Low rank
  - LoRA
- Sparsity / pruning
  - Non-structured, structured, 2:4, MOE
- Parallelism
  - Parallel decoding: Speculative Decoding, Text Diffusion
  - Parallel Training: TP, PP, EP, CP, DP
- Linear-Time Sequence Modeling
  - Linear Transformer, xLSTM, Mamba



#### Efficiency of Large Foundation Models

- Quantization
  - QAT, Post-training Quantization, QLoRA, FP8 training
- Low rank
  - LoRA
- Sparsity / pruning
  - Non-structured, structured, 2:4, MOE
- Parallelism
  - Parallel decoding: Speculative Decoding, Text Diffusion
  - Parallel Training: TP, PP, EP, CP, DP
- Linear-Time Sequence Modeling
  - Linear Transformer, xLSTM, Mamba



#### **Quantization for Efficiency – Taxonomy**

#### Efficiency targeted phases

- Training efficiency: FP8 training
- Fine-tuning efficiency: QLoRA
- Inference efficiency:
  - Quantization-aware training
    - This is the go-to approach if accuracy is more important
    - Edge models are relatively small in practice, so the cost is acceptable
    - Straight-Through Estimator with grouping is a very strong baseline
  - Post-training Quantization
    - SpinQuant, SmoothQuant



#### **Quantization – Basics**

- Numerical bias
  - Deterministic rounding bias in a quantization group, minimal/no bias in the final logit?
  - Stochastic rounding no bias
- Numerical variance
  - Key problem!
  - Research focus: variance reduction
    - Constraining outlier scale
    - Grouping if your group size is 1, quantization is floating-precision
      - A small group size (e.g. 32) can significantly reduce variance with minimal overhead



#### **Quantization – Outlier Constraint**

- Clipping
- Random rotation
- Rescaling
- .....



# Quantization – Outlier Constraint: Clipping

TernGrad: layer-wise clipping + grouping





# **Quantization – Outlier Constraint: Rotation**





# **Quantization – Outlier Constraint: Rescaling**





#### Efficiency of Large Foundation Models

- Quantization
  - QAT, Post-training Quantization, QLoRA, FP8 training
- Low rank
  - LoRA
- Sparsity / pruning
  - Non-structured, structured, 2:4, MOE
- Parallelism
  - Parallel decoding: Speculative Decoding, Text Diffusion
  - Parallel Training: TP, PP, EP, CP, DP
- Linear-Time Sequence Modeling
  - Linear Transformer, xLSTM, Mamba



# Low-rank + Quantization for Fine-tuning: QLoRA



**Figure 1:** Different finetuning methods and their memory requirements. QLoRA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.



#### Efficiency of Large Foundation Models

- Quantization
  - QAT, Post-training Quantization, QLoRA, FP8 training
- Low rank
  - LoRA
- Sparsity / pruning
  - Non-structured, structured, 2:4, MOE
- Parallelism
  - Parallel decoding: Speculative Decoding, Text Diffusion
  - Parallel Training: TP, PP, EP, CP, DP
- Linear-Time Sequence Modeling
  - Linear Transformer, xLSTM, Mamba



### **Sparsity / Pruning -- Patterns**

- Non-structured sparsity
- Less popular because of computation inefficiency

Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. *Advances in neural information processing systems*, 29.

Structured sparsity

- 5.17X speedup
- Remove weights group by group
- Structured in a way with high compute efficiency
  - E.g. NVIDIA 2:4 sparsity





### **Sparsity / Pruning -- Methods**

- Thresholding
- Regularization
- Optimizer



## **SparseGPT**





Frantar, E., & Alistarh, D. (2023, July). Sparsegpt: Massive language models can be accurately pruned in one-shot. In *International Conference on Machine Learning* (pp. 10323-10337). PMLR.

# Natively Sparse Models: Mixture of Experts





#### Efficiency of Large Foundation Models

- Quantization
  - QAT, Post-training Quantization, QLoRA, FP8 training
- Low rank
  - LoRA
- Sparsity / pruning
  - Non-structured, structured, 2:4, MOE
- Parallelism
  - Parallel decoding: Speculative Decoding, Text Diffusion
  - Parallel Training: TP, PP, EP, CP, DP
- Linear-Time Sequence Modeling
  - Linear Transformer, xLSTM, Mamba



#### **Parallelism**

- Parallel decoding
  - Speculative Decoding
  - Text Diffusion
- Parallel Training
  - Data parallelism
    - Vanilla
    - ZeRO / FSDP sharding
  - Model parallelism
    - Tensor parallelism
    - Pipeline parallelism
    - Context parallelism
    - Expert parallelism



## **Speculative Decoding**

#### WITHOUT SPECULATIVE DECODING



My favorite thing about fall

#### WITH SPECULATIVE DECODING



My favorite thing about fall



#### Algorithm 1 SpeculativeDecodingStep

```
Inputs: M_p, M_q, prefix.
\triangleright Sample \gamma guesses x_{1,\ldots,\gamma} from M_q autoregressively.
for i=1 to \gamma do
   q_i(x) \leftarrow M_q(prefix + [x_1, \dots, x_{i-1}])
   x_i \sim q_i(x)
end for
\triangleright Run M_p in parallel.
p_1(x), \ldots, p_{\gamma+1}(x) \leftarrow
      M_p(prefix), \ldots, M_p(prefix + [x_1, \ldots, x_{\gamma}])
\triangleright Determine the number of accepted guesses n.
r_1 \sim U(0,1), \ldots, r_{\gamma} \sim U(0,1)
n \leftarrow \min(\{i-1 \mid 1 \le i \le \gamma, r_i > \frac{p_i(x)}{q_i(x)}\} \cup \{\gamma\})
\triangleright Adjust the distribution from M_p if needed.
p'(x) \leftarrow p_{n+1}(x)
if n < \gamma then
   p'(x) \leftarrow norm(max(0, p_{n+1}(x) - q_{n+1}(x)))
end if
\triangleright Return one token from M_p, and n tokens from M_q.
t \sim p'(x)
return prefix + [x_1, \ldots, x_n, t]
```

## Follow-up works: MEDUSA, EAGLE

## Data Parallelism – ZeRO (in DeepSpeed)





#### Data Parallelism - FSDP

Fully Sharded Data Parallel (FSDP) -- A PyTorch implementation





#### **Model Parallelism**





#### Efficiency of Large Foundation Models

- Quantization
  - QAT, Post-training Quantization, QLoRA, FP8 training
- Low rank
  - LoRA
- Sparsity / pruning
  - Non-structured, structured, 2:4, MOE
- Parallelism
  - Parallel decoding: Speculative Decoding, Text Diffusion
  - Parallel Training: TP, PP, EP, CP, DP
- Linear-Time Sequence Modeling
  - Linear Transformer, xLSTM, Mamba



# Linear-Time Sequence Modeling – Linear Transformer





## Linear-Time Sequence Modeling – Mamba

#### **Selective State Space Model**

with Hardware-aware State Expansion





## Linear-Time Sequence Modeling – xLSTM

xLSTM: Extended Long Short-Term Memory



Figure 3: xLSTM blocks. Left: A residual sLSTM block with post up-projection (like Transformers): The input is fed into an sLSTM — with an optional convolution — followed by a gated MLP. Right: A residual mLSTM block with pre up-projection (like State Space models): mLSTM is wrapped inside two MLPs, via a convolution, a learnable skip connection, and an output gate that acts component-wise. See Figure 10 and Figure 11 in the appendix for details.





# Application of Foundation Models in EDA

Zhiyao Xie, Assistant Professor, HKUST

Duration: ~1 hour







## **Challenges** in Delivering Better Chips



#### Increasing IC design complexity



- IC complexity



Apple A11 Apple A15
4B transistors 15B transistors

Apple M3 Max 92B transistors

- Increasing IC design cost
- Increasing time to market

IC Design Cost is Skyrocketing



(Not including manufacturing)



## **How Al Assists EDA - Our Taxonomy**

Type I: Supervised Predictive AI Techniques for EDA



Type II: Foundation AI Techniques for EDA (Circuit Foundation Model)





## **How Al Assists EDA - Our Taxonomy**

Type I: Supervised Predictive AI Techniques for EDA



Type II: Foundation Al Techniques for EDA (Circuit Foundation Model)





## **Explorations in Predictive Al Methods**

Predictive AI supports many applications: both early evaluation & optimization



**AI-Assisted IC Quality Prediction** 



**AI-Guided IC Optimization** 



**AI-Guided IC Design Space Exploration** 

Explored in academia & industry, cover all stages







[1] Machine learning for electronic design automation: A survey. **ACM TODAES, 2021.** 

[2] MLCAD: A survey of research in machine learning for CAD keynote paper. IEEE TCAD, 2021.

## Predictive AI for EDA/Circuit Design





- Producing solutions repeatedly from scratch
- Why not learn from prior solutions? More intelligence!

## Predictive AI for EDA/Circuit Design





- Why not learn from prior solutions? More intelligence!
- ML in Electronic Design Automation: <u>Early Timing and Power Modeling</u>

## **Example: Timing & Power Evaluation of RTL Code?**



- Given an RTL, can we directly evaluate its <u>timing</u> and <u>power</u>?
  - Fine-grained **timing**: slack per register
  - Fine-grained **power**: per-cycle power



## Case 1: Early Timing Prediction at RTL-Stage

- Fine-grained timing model at RTL
  - Evaluate slack of each register endpoint
  - Annotate slack directly on HDL
- Guide optimization during synthesis
  - Guide retime and path\_group



#### High **correlation** in prediction



#### Better post-opt timing distribution





[1] Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization, **DAC 2024** [2] Transferable Pre-Synthesis PPA Estimation for RTL Designs with Data Augmentation Techniques, **TCAD 2024** 



## Case 1: Early Timing Prediction at RTL-Stage

- Fine-grained timing model at RTL
  - Evaluate slack of each register endpoint
  - Annotate slack directly on HDL
- Guide optimization during synthesis
  - Guide retime and path\_group



#### High **correlation** in prediction



#### Better post-opt timing distribution





[1] Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization, **DAC 2024** [2] Transferable Pre-Synthesis PPA Estimation for RTL Designs with Data Augmentation Techniques, **TCAD 2024** 

## Case 1: Early Timing Prediction at RTL-Stage

- Fine-grained timing model at RTL
  - Evaluate slack of each register endpoint

Guide optimization during synthesis

Annotate slack directly on HDL



#### Gukey idea: learn the pattern of input RTL logic







[1] Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization, **DAC 2024** [2] Transferable Pre-Synthesis PPA Estimation for RTL Designs with Data Augmentation Techniques, **TCAD 2024** 

- Per-cycle power model at RTL
  - Capture key RTL signals as inputs (proxies)
  - Fast & accurate design-time simulation
  - Low-cost & accurate on-chip power model







**Small** OPM in CPU layout (pink)



- [1] APOLLO: An Automated Power Modeling Framework for ... Microprocessors, MICRO 2021 (Best Paper Award)
- [2] DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters," ICCAD 2022
- [3] Unleashing Flexibility of ML-based Power Estimators Through Efficient Development Strategies, ISLPED 2024 (Best Paper Nomination)



- Per-cycle power model at RTL
  - Capture key RTL signals as inputs (proxies)
  - Fast & accurate design-time simulation
  - Low-cost & accurate on-chip power model







**Small** OPM in CPU layout (pink)



- [1] APOLLO: An Automated Power Modeling Framework for ... Microprocessors, MICRO 2021 (Best Paper Award)
- [2] DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters," ICCAD 2022
- [3] Unleashing Flexibility of ML-based Power Estimators Through Efficient Development Strategies, ISLPED 2024 (Best Paper Nomination)

- Per-cycle power model at RTL
  - Capture key RTL signals as inputs (proxies)
  - Fast & accurate design-time simulation
  - Low-cost & accurate on-chip power model







**Small** OPM in CPU layout (pink)



- [1] APOLLO: An Automated Power Modeling Framework for ... Microprocessors, MICRO 2021 (Best Paper Award)
- [2] DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters," ICCAD 2022
- [3] Unleashing Flexibility of ML-based Power Estimators Through Efficient Development Strategies, ISLPED 2024 (Best Paper Nomination)

- Per-cycle power model at RTL
  - Capture key RTL signals as inputs (proxies)

Low-cost & accurate on-chip power model

Fast & accurate design-time simulation



#### Key idea: capture most power-related RTL signals





**Small** OPM in CPU layout (pink)



- [1] APOLLO: An Automated Power Modeling Framework for ... Microprocessors, MICRO 2021 (Best Paper Award)
- [2] DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters," ICCAD 2022
- [3] Unleashing Flexibility of ML-based Power Estimators Through Efficient Development Strategies, ISLPED 2024 (Best Paper Nomination)

## **How Al Assists EDA - Our Taxonomy**

#### Type I: Supervised Predictive AI Techniques for EDA

- Difficulty in getting sufficient labeled data
- Time-consuming Al model development process
- Lack of generalization across tasks







## **Opportunities from Foundation Models**

- Emergence of large foundation models in many fields
  - Unprecedented ability to understand, predict, and generate content



Language model: GPT

Q: Image (A potato king)



Image model: DALL-E

Q: Video (A family of monsters)





Video model: Sora



## Why no counterpart in Al for chip design?



#### Trend of AI in all fields:

Task-specific → General

Small data → Big data

Supervised → Unsupervised

Single-modality → Multimodal



## Why no counterpart in Al for chip design?



Trend of AI in all fields:

Task-specific → General

Small data → Big data

Supervised → Unsupervised

Single-modality → Multimodal



Circuit Foundation Model (CFM)

## **How Al Assists EDA- Our Taxonomy**

Type I: Supervised Predictive AI Techniques for EDA

Type II: Foundation AI Techniques for EDA (Circuit Foundation Model)



Paradigm 1: Encoder-based circuit foundation models

Paradigm 2: **Decoder**-based circuit foundation models



## Rethink Circuits from Data Perspective

- Chip is a delicate structured implementation of functionality
  - Minor structure change (flipping a gate) drastically affect functionality
- Chip is inherently multi-stage and multi-modality:
  - Different level of details across stages
- Lack of chip data:
  - The most important IP/asset
  - No companies share their chip design





## Paradigms of Al for EDA Techniques



(b) Type II: General Encoder-Based Circuit Foundation Model Paradigm







#### **Encoder-based circuit foundation model**





#### Decoder-based circuit foundation model



134

## **How Al Assists EDA- Our Taxonomy**

Type I: Supervised Predictive AI Techniques for EDA

Type II: Foundation AI Techniques for EDA (Circuit Foundation Model)



Paradigm 1: Encoder-based circuit foundation models

Paradigm 2: Decoder-based circuit foundation models



#### **Encoder-based circuit foundation model**





## **Encoder Model at RTL Stage**



| Method                   | Technique<br>Pre-train objective             | Downstream Task                                                                                                                                                                                                                                     |
|--------------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SNS v2 [25]              | Functional contrastive learning              | Post-synthesis PPA prediction                                                                                                                                                                                                                       |
| CircuitEncoder [102]     | Intra-stage functional contrastive learning  | Post-synthesis PPA prediction                                                                                                                                                                                                                       |
|                          | Cross-stage functional contrastive alignment |                                                                                                                                                                                                                                                     |
| RTL   CircuitFusion [97] | Masked gate reconstruction                   | Post-synthesis PPA prediction                                                                                                                                                                                                                       |
|                          | Functional contrastive for graph/ summary    |                                                                                                                                                                                                                                                     |
|                          | Modality fusion                              |                                                                                                                                                                                                                                                     |
|                          | Cross-design-stage alignment                 |                                                                                                                                                                                                                                                     |
|                          | SNS v2 [25] CircuitEncoder [102]             | SNS v2 [25] Functional contrastive learning  CircuitEncoder [102] Intra-stage functional contrastive learning  Cross-stage functional contrastive alignment  Masked gate reconstruction  Functional contrastive for graph/ summary  Modality fusion |



## Multimodal Representation Learning, on RTL?

- Encode & fuse information from diverse modalities
  - Vision-language
  - Graph-language
  - Software-graph

•





Can we fuse multiple circuit modalities to learn better circuit representation?



## **Summary of Circuit Modalities**

Multimodal nature of RTL-stage circuits

- Functionality
  Summary
- Implementation Details

**Functionality Summary** 

```
reg [1:0] R0,R1;
reg [2:0] R2;
wire [2:0] W1,W2;
...
assign W1 = R0 + R1;
...
always @(posedge clk)
   R2 <= W2;</pre>
```

HDL Code



Structure Graph



semantic

structure

#### RTL Circuit Encoder: CircuitFusion

- Pre-Training: Multimodal circuit encoder (unsupervised) training
  - 1. Learn to recognize masked circuit elements
  - 2. Learn to recognize circuits with the same functionality

Unsupervised contrastive learning

An RTL (sub-)circuit



A multimodal circuit encoder



The encoder converts RTL into a general embedding

#### RTL Circuit Encoder: CircuitFusion

- Pre-Training: Multimodal circuit encoder (unsupervised) training
  - 1. Learn to recognize masked circuit elements
  - 2. Learn to recognize circuits with the same functionality

Unsupervised contrastive learning





## Preprocessing: Split Circuit to Sub-circuits

- Circuit Property 1: parallel execution
  - Combinational logic calculates simultaneously
  - Sequential registers are updated only at each clock cycle
- Strategy 1: sub-circuit generation
  - Split based on register cones
    - Backtrace all combinational input logic
  - Advantages
    - Consistency in Modality & stage
    - Complete state transition of 1 cycle
    - Intermediate granularity





## CircuitFusion Pre-Training (1/2)

- Circuit Property 2: functional equivalent transformation
  - Circuit w. similar function may have different structures
- Strategy 2: semantic-structure pre-training
  - Self-supervised Task #1-3 for each modality and multimodal fusion





## CircuitFusion Pre-Training (2/2)

- Circuit Property 3: multiple design stages
  - RTL (high-level semantics) → netlist (low-level details)
- Strategy 3: implementation-aware alignment
  - Pre-training with netlist encoder across design stage (Task #4)





# Design Quality Prediction Tasks at RTL

#### High performance on RTL-stage PPA prediction:

| Туре                     | Method          | Slack    |      | WNS  |      | TNS        |      | Power   |      | Area |      |
|--------------------------|-----------------|----------|------|------|------|------------|------|---------|------|------|------|
|                          |                 | R        | MAPE | R    | MAPE | R          | MAPE | R       | MAPE | R    | MAPE |
| Hardware<br>Solution     | RTL-Timer       | 0.85 17% |      | 0.9  | 16%  | 0.96 25% N |      | N/A N/A |      |      |      |
|                          | MasterRTL       | N/A      |      | 0.89 | 18%  | 0.94       | 28%  | 0.89    | 26%  | 0.98 | 16%  |
|                          | SNS v2          | N/A      |      | 0.82 | 22%  | N/A        |      | 0.76    | 28%  | 0.93 | 25%  |
| Text Encoder             | NV-Embed-v1     | 1        | N/A  | 0.49 | 17%  | 0.97       | 55%  | 0.85    | 44%  | 0.86 | 24%  |
| Software Code<br>Encoder | UnixCoder       | N/A      |      | 0.46 | 21%  | 0.95       | 44%  | 0.83    | 29%  | 0.85 | 26%  |
|                          | CodeT5+ Encoder | N/A      |      | 0.55 | 21%  | 0.63       | 43%  | 0.49    | 46%  | 0.45 | 39%  |
|                          | CodeSage        | N/A      |      | 0.23 | 25%  | 0.86       | 45%  | 0.8     | 38%  | 0.77 | 41%  |
| Ours                     | CircuitFusion   | 0.87     | 12%  | 0.91 | 11%  | 0.99       | 15%  | 0.99    | 13%  | 0.99 | 11%  |







### **Encoder-based circuit foundation model**





# **Encoder Model at Netlist Stage**



| Target  | Method                | Modality |              | Pre-Trai        | ning       | Downstream Task |               |  |
|---------|-----------------------|----------|--------------|-----------------|------------|-----------------|---------------|--|
| Stage   |                       | Graph    | Text         | Self-Supervised | Supervised | Design Quality  | Functionality |  |
|         | DeepGate [103]        | ✓        |              |                 | ✓          |                 | ✓             |  |
|         | DeepGate2 [104]       | ✓        |              |                 | ✓          |                 | $\checkmark$  |  |
|         | DeepGate3/4 [98, 105] | ✓        |              |                 | ✓          |                 | $\checkmark$  |  |
|         | GAMORA [54]           | ✓        |              |                 | ✓          |                 | $\checkmark$  |  |
|         | HOGA [106]            | ✓        |              |                 | ✓          | ✓               | $\checkmark$  |  |
| Matliat | PolarGate [107]       | ✓        |              |                 | ✓          |                 | $\checkmark$  |  |
| Netlist | DeepSeq [108, 109]    | ✓        |              |                 | ✓          | ✓               |               |  |
|         | FGNN [110, 111]       | ✓        |              | ✓               |            |                 | $\checkmark$  |  |
|         | CircuitEncoder [102]  | ✓        |              | ✓               |            |                 | $\checkmark$  |  |
|         | MGVGA [112]           | ✓        | $\checkmark$ | ✓               |            | ✓               | ✓             |  |
|         | NetTAG [113]          | ✓        | $\checkmark$ | ✓               |            | ✓               | ✓             |  |
|         | DeepCell [114]        | ✓        |              | ✓               |            |                 | ✓             |  |



## Multimodal Circuit Learning: RTL vs Netlist

- Multimodal learning: fuse information from diverse modalities
  - Vision-language
  - Software-graph
  - •





- Multimodal learning on RTL
  - Register-transfer level (RTL)
  - Earlier stage → more semantic
  - Fuse 3 RTL modalities at register level

- Multimodal learning on netlist?
  - Gate-level netlists
  - Later stage → more structure
  - Should fuse at gate level







### NetTAG: A Multimodal Netlist Encoder

- Netlist functional and physical properties in text-attributed graph
  - Multimodal preprocess: formulate netlist as text-attributed graph
  - Multimodal model: fuse gate text (LLM) with circuit graph (GNN)
  - Multimodal pre-train: self-supervised and cross-stage-aware





### **NetTAG Framework Overview**

#### 1. Preprocessing → 2. Pre-Training → 3. Fine-Tuning





# 2. Self-Supervised Pre-Training

- Two-phase encoding → Two-step pre-training
  - Capture netlist functional and physical properties





# 2. Self-Supervised Pre-Training (1/2)

- Step 1: Enhancing logic understanding in ExprLLM
  - Goal 1: Differentiate gate expression functionality
  - Objective # 1: Symbolic expression contrastive learning



- Build gate expression dataset
  - 2-hop symbolic expressions
  - Boolean equivalence transformation rules



# 2. Self-Supervised Pre-Training (2/2)

- Step 2: Fusion in TAGFormer & Cross-Stage Align
  - Goal 2: Training within TAGFormer for semantic and structure fusion
  - Objective # 2.1: Masked gate reconstruction
    - Gate-level
    - Predict masked gate type to capture gate structure
  - Objective # 2.2: Netlist graph contrastive learning
    - Circuit-level
    - Differentiate graph functionality
  - Objective # 2.3: Netlist graph size prediction
    - Circuit-level
    - Predict gate count to capture graph structure





### **Applications of NetTAG: 4 tasks**

- Task 1: Combinational gate function identification
  - Identify functional type (e.g., adder, multiplier) of each gate
- Task 2: Sequential state/data register identification
  - Differentiate state registers and data path registers for each register
- Task 3: Endpoint register slack prediction
  - Predict layout timing slack of each register
- Task 4: Overall circuit power/area prediction
  - Predict layout power and area of the whole design



### **NetTAG Results & Discussion**

#### Scalability

 Performance per task all scale up with model and data





#### Demo

- Reasoning the netlist arithmetic function
- Next step: NetTAG-LLM alignment<sup>1</sup> for generative reasoning





## **How Al Assists EDA- Our Taxonomy**

Type I: Supervised Predictive AI Techniques for EDA

Type II: Foundation AI Techniques for EDA (Circuit Foundation Model)



Paradigm 1: Encoder-based circuit foundation models

Paradigm 2: **Decoder**-based circuit foundation models



### Decoder-based circuit foundation model





# **LLMs Enable Many Generative Applications**



### Decoder-based circuit foundation model





### **Generative AI: LLM for RTL Generation**

#### Task: LLM-based RTL Generation

- Input: natural language description
  - Target design functionality.
  - Module names, I/O names.
- Output: design in RTL code



### In addition to hardware code generation:

LLM for hardware code optimization, debugging, verification, ...



# Benchmarking LLM for RTL Generation



| Benchmarks for RTL Code Generation |          |                                                 |         |  |  |  |
|------------------------------------|----------|-------------------------------------------------|---------|--|--|--|
| Benchmarks Open-sourced            |          | link                                            | Date    |  |  |  |
| RTLLM [12, 154]                    | <b>√</b> | https://github.com/hkust-zhiyao/rtllm           | 2023-10 |  |  |  |
| VerilogEval [14]                   |          | https://github.com/NVlabs/verilog-eval          | 2023-12 |  |  |  |
| VerilogEval v2[165]                | •        | ittps://github.com/iv viabs/vernog-evar         | 2024-08 |  |  |  |
| CreativeEval [147]                 | ✓        | https://github.com/matthewdelorenzo/creativeval | 2024-04 |  |  |  |
| RTL-repo [153]                     | ✓        | https://github.com/AUCOHL/RTL-Repo              | 2024-05 |  |  |  |
| VHDL-Eval [148]                    |          |                                                 | 2024-06 |  |  |  |
| ChatGPTV [149]                     | ✓        | https://github.com/aichipdesign/chipgptv        | 2024-11 |  |  |  |



# Example: RTLLM2.0

# 50 design problems

### Four categories

- 1. Arithmetic Modules
- 2. Memory Modules
- 3. Control Modules
- 4. Miscellaneous Modules



|                                                                                    | Arithmetic Modules                                                                                                                                                                                                                                                                                         | Memory Modules                                                 |                                                                                                                                                                                                                                                                                                                         |  |  |  |
|------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Design                                                                             | Description                                                                                                                                                                                                                                                                                                | Design                                                         | Description                                                                                                                                                                                                                                                                                                             |  |  |  |
| adder_8bit adder_16bit adder_32bit adder_pipe_64bit adder_bcd sub_64bit multi_8bit | An 8-bit adder A 16-bit adder implemented with full adders A 32-bit carry-lookahead adder A 64-bit ripple carry adder based on 4-stage pipeline A BCD adder for decimal arithmetic operations A 64-bit subtractor for high-precision arithmetic An 8-bit multiplier based on shifting and adding operation | asyn_fifo LIFObuffer right_shifter LFSR barrel_shifter RAM ROM | An asynchronous FIFO 16×8 bits A Last-In-First-Out buffer for temporary data storage Right shifter with 8-bit delay A Linear Feedback Shift Register for generating pseudo-random sequences A barrel shifter for rotating bits efficiently 8x4 bits true dual-port RAM A Read-Only Memory module for storing fixed data |  |  |  |
| multi_16bit                                                                        | A 16-bit multiplier based on shifting and adding operation                                                                                                                                                                                                                                                 |                                                                | Miscellaneous Modules                                                                                                                                                                                                                                                                                                   |  |  |  |
| multi_booth_8bit                                                                   | An 8-bit booth-4 multiplier                                                                                                                                                                                                                                                                                | Design                                                         | Description                                                                                                                                                                                                                                                                                                             |  |  |  |
| multi_pipie_4bit                                                                   | A 4-bit unsigned number pipeline multiplier                                                                                                                                                                                                                                                                | clkgenerator                                                   | A clock generator for providing timing signals                                                                                                                                                                                                                                                                          |  |  |  |
| multi_pipie_8bit                                                                   | An 8-bit unsigned number pipeline multiplier                                                                                                                                                                                                                                                               | instr_reg                                                      | An instruction register module for holding and processing CPU instructions                                                                                                                                                                                                                                              |  |  |  |
| div_16bit<br>radix2_div<br>comparator_3bit<br>comparator_4bit                      | A 16-bit divider based on subtraction operation<br>An 8-bit radix-2 divider<br>A 3-bit comparator for comparing binary numbers<br>A 4-bit comparator for comparing binary numbers                                                                                                                          | signal_generator square_wave alu pe                            | Generate various signal patterns A generator for producing square wave signals An ALU for 32bit MIPS-ISA CPU A Multiplying Accumulator for 32bit integer                                                                                                                                                                |  |  |  |
| accu                                                                               | Accumulates 8-bit data and output after 4 inputs                                                                                                                                                                                                                                                           | freq_div                                                       | Frequency divider for 100M input clock, outputs 50MHz, 10MHz, 1MHz                                                                                                                                                                                                                                                      |  |  |  |
| fixed_point_adder                                                                  | A fixed-point adder for arithmetic operations with fixed precision                                                                                                                                                                                                                                         | freq_divbyeven                                                 | Frequency divider that divides input frequency by even numbers                                                                                                                                                                                                                                                          |  |  |  |
| fixed_point_substract                                                              | A fixed-point subtractor for precise fixed-point arithmetic                                                                                                                                                                                                                                                | freq_divbyodd                                                  | Frequency divider that divides input frequency by odd numbers                                                                                                                                                                                                                                                           |  |  |  |
| float_multi                                                                        | A floating-point multiplier for high-precision calculations                                                                                                                                                                                                                                                | freq_divbyfrac                                                 | Frequency divider that divides input frequency by fractional values                                                                                                                                                                                                                                                     |  |  |  |
|                                                                                    | Control Modules                                                                                                                                                                                                                                                                                            | calendar                                                       | Perpetual calendar with seconds, minutes, and hours                                                                                                                                                                                                                                                                     |  |  |  |
| Design                                                                             | Description                                                                                                                                                                                                                                                                                                | traffic_light                                                  | Traffic light system with three colors and pedestrian button                                                                                                                                                                                                                                                            |  |  |  |
| fsm                                                                                | FSM detection circuit for specific input                                                                                                                                                                                                                                                                   | width_8to16                                                    | First 8-bit data placed in higher 8-bits of the 16-bit output                                                                                                                                                                                                                                                           |  |  |  |
| sequence_detector<br>counter_12                                                    | Detect specific sequences in binary input<br>Counter module counts from 0 to 12<br>A 4-bit Johnson counter with specific cyclic                                                                                                                                                                            | synchronizer<br>edge_detect                                    | Multi-bit mux synchronizer  Detect rising and falling edges of changing 1-bit signal  Extract pulse signal from the fast clock and create a                                                                                                                                                                             |  |  |  |
| JC_counter                                                                         | state sequence                                                                                                                                                                                                                                                                                             | pulse_detect                                                   | new one in the slow clock                                                                                                                                                                                                                                                                                               |  |  |  |
| ring_counter                                                                       | An 8-bit ring counter for cyclic state sequences A 16-bit counter that can increment or                                                                                                                                                                                                                    | parallel2serial                                                | Convert 4 input bits to 1 output bit                                                                                                                                                                                                                                                                                    |  |  |  |
| up_down_counter                                                                    | decrement based on control signals                                                                                                                                                                                                                                                                         | serial2parallel                                                | 1-bit serial input and output data after receiving 6 inputs                                                                                                                                                                                                                                                             |  |  |  |

## **LLM for RTL Generation Methodologies**

#### Using commercial LLMs → circuit privacy concerns

1. Prompt engineering on commercial LLMs.

### Using **open-source** LLMs → allows **local deployment**

- 2. LLMs fine-tuned on <u>private</u> datasets with instruction-code pairs
- 3. LLMs fine-tuned on open datasets with code only
- 4. LLMs fine-tuned on open datasets with instruction-code pairs

Challenge: How to get the dataset?



# Basic flow using prompt engineering



- Input specification + structure analysis and design principles (in prompt)
- Feed prompt into LLMs → RTL code
- Incorporate the feedback from EDA tools into the flow for rewriting



### **Generation of RTL Code Dataset**



- Data generation flow of RTLCoder, as an example
- Other works adopt similar methodologies for dataset generation
- 1. Generate diverse instructions (design specifications)
- 2. Generate high-quality reference code
- 3. Collect the instruction-code pairs for (supervised) fine-tuning



### Performance in RTL Generation





### Other Directions Besides Code Generation

In addition to LLMs for Hardware (RTL or HLS) Generation:

- LLMs for Hardware (Code) Optimizations
- LLM for Hardware (Code) Verification
- LLMs for Hardware (Code) Security
- •
- LLM for Design Flow Automation
- LLM for Layout Design
- LLM for Analog Design



# Challenges & Room for Improvement

1. Circuit Foundation Model Generalization and Scalability

2. Circuit Data Availability

3. Bridging the Gap Between Circuit Encoder and Decoder



# Lack of Circuit? Generate Synthetic Circuits

Solution: Generate synthetic pseudo-circuits for foundation model training



- Real circuit designs are private
- Synthetic circuit generation based on graph generation models
- Synthetic circuits enable "big data"



Synthetic designs are similar to real designs

Level of Logic: 43
Critical Path Length: 3.264
Critical Path Slack: -2.732
No. of Violating Paths: 3867
.....

Cell & Pin Count

Pin Count: 640145

173466

Synthetic designs reach >100K cells

|                                     | Target | R          | MAPE         | RRSE |
|-------------------------------------|--------|------------|--------------|------|
| No Pseudo-Circuits<br>GraphRNN [27] | WNS    | NA<br>0.71 | 52 %<br>42 % | 2.1  |
| DVAE [28]                           | WINS   | 0.75       | 77%          | 2.6  |
| CircuitGen                          | ]      | 0.88       | 36 %         | 1.3  |

Pseudo-designs can boost Al accuracy in IC prediction

[1] Towards Big Data in AI for EDA Research: Generation of New Pseudo Circuits at RTL Stage, **ASPDAC'25** [2] SynCircuit: Automated Generation of New Synthetic RTL Circuits Can Enable Big Data in Circuits, **DAC'25** 

**Leaf Cell Count:** 

# Bridging the Gap Between Circuit Encoder and Decoder

An Encoder-Decoder framework with connectors







(c) Our proposed circuit encoder-decoder framework

# Takeaway: Paradigms of Al for EDA



(b) Type II: General Encoder-Based Circuit Foundation Model Paradigm





(c) Type II: General <u>Decoder-Based</u> Circuit Foundation Model Paradigm

