3. Transfomer

1. Overview

Transformer Basics

A Transformer model is a type of deep learning architecture that was introduced in a 2017 paper titled “Attention is All You Need” by Vaswani et al. It revolutionized natural language processing (NLP) and sequential data processing by improving upon traditional recurrent models (like RNNs or LSTMs). The key innovation of the Transformer is its use of self-attention mechanisms to capture relationships between different parts of an input sequence, without relying on recurrence or convolution.

1.1 Key Components

Self-Attention Mechanism:
- Self-attention allows each token (e.g., word or subword in NLP) in a sequence to focus on other tokens, regardless of their distance in the sequence. This is in contrast to RNNs, which process sequences in a step-by-step manner.
- The attention mechanism works by computing a weighted sum of all input tokens, where the weights depend on the relevance of one token to another.
There are three key vectors for each token:
- Query (Q): What information this token is looking for in the other tokens.
- Key (K): The characteristics that other tokens have for the token to attend to.
- Value (V): The information that the token carries.
The output of attention is computed as:

where is the dimensionality of the key vectors.
Multi-Head Attention:
- Instead of calculating a single attention output, the Transformer splits the attention mechanism into multiple “heads”. Each head independently computes attention, and the results are concatenated and linearly transformed.
- This allows the model to capture different types of relationships within the data at multiple scales.
Positional Encoding:
- Since Transformers do not have a built-in notion of sequence order (unlike RNNs), a positional encoding is added to the input embeddings to provide some information about the order of tokens in the sequence.
- Typically, sinusoidal functions are used to generate these positional encodings, providing continuous, distinguishable values for different positions.
Feedforward Neural Network:
- Each layer of the Transformer also includes a fully connected feedforward network applied to each position independently. It consists of two linear transformations with a ReLU activation in between.
Layer Normalization and Residual Connections:
- Layer normalization is applied to stabilize and speed up training by normalizing the input of each sub-layer.
- Residual connections allow the model to train deeper networks by enabling easier gradient flow and improving convergence.

1.2 Transformer Architecture

A Transformer is composed of two main components:

Encoder: Takes the input sequence, processes it through multiple layers (each having self-attention and feedforward networks), and generates a contextualized representation of the input.
Decoder: Processes the target sequence and generates an output sequence. Each decoder layer has an additional cross-attention mechanism to attend to the encoder’s output.

1.2.1 Encoder-Decoder Structure:

Encoder:
- Processes input data.
- Each layer has a multi-head self-attention and a feedforward network.
Decoder:
- Processes target data (used during training) or generates output step by step during inference.
- Each layer has three components: self-attention, cross-attention (attending to encoder output), and a feedforward network.

1.2.2 Transformer Use Cases:

Natural Language Processing: It’s used in tasks such as machine translation (e.g., Google’s BERT, OpenAI’s GPT models), text summarization, and sentiment analysis.
Vision: Transformers have been adapted to vision tasks, such as Vision Transformers (ViT) for image classification.
Speech Processing: Used for automatic speech recognition and text-to-speech.
Generative Models: Transformers power models like GPT (Generative Pretrained Transformers), which are used for text generation tasks, such as ChatGPT.

1.3 Advantages of Transformers

Parallelism: Since Transformers do not process data sequentially, training and inference can be more parallelized compared to RNNs, making them faster on modern hardware.
Long-Range Dependencies: Self-attention allows capturing relationships between tokens regardless of their distance in the sequence, which is difficult for RNNs to handle effectively.

2. Transformer - Adavnce

Think Like a Transfomer

3. Transformer-Based Models:

BERT (Bidirectional Encoder Representations from Transformers): A model designed to pretrain deep bidirectional representations by jointly conditioning on both left and right contexts in all layers.
GPT (Generative Pretrained Transformer): A model trained on vast amounts of text data to generate human-like text. It uses the decoder part of the Transformer.
T5 (Text-to-Text Transfer Transformer): A unified framework that treats every NLP task as a text-to-text problem.