Transformer From Scratch | Youness El Brag

Transformers Numpy

this is Implementation of Transfomrers Numpy Version from Scratch which all LLM based on have abetter understaning cam help to build this type of model in right concept Welcome to the repository for the TransformersNumpy-Version project! This project focuses on implementing Transformers, a groundbreaking model in the field of natural language processing and machine learning, using the powerful numpy library.

Introduction

Transformers have revolutionized the way we process and understand natural language, enabling breakthroughs in tasks such as machine translation, sentiment analysis, and question-answering systems. This repository aims to provide a comprehensive implementation of Transformers using numpy, showcasing the core concepts and functionalities of this powerful model.

Key Features

Numpy Implementation: The implementation in this repository heavily relies on the numpy library, allowing for efficient computations and easy-to-understand code.
Full Implementation: The repository provides a complete implementation of Transformers, including attention mechanisms, positional encoding, and feed-forward networks. .

Here’s an overview of the contents you’ll find in this repository:

Decoder.py and Encoder.py: This file contains the main implementation of the Transformer model using numpy. It includes classes and functions for attention mechanisms, positional encoding, and the overall Transformer architecture.
LayersNumpy.py: This file provides utility functions for data preprocessing and handling, enabling seamless integration with different datasets.
transfomers-explained-mathematic-and-code-in-depth.ipynb: This Jupyter Notebook serves as an example to showcase how to use the Transformer model implemented in this repository. It includes a step-by-step walkthrough of training and evaluating the model on a specific task.
main.py: this provide full combinations of the blocks Transformser.

Self-Attention Math

def Self_Attention(input_embedding ,WieghtMatrix_QKY, out_wieghts,mask=None,batch_first=True) :
    """
    Self-Attention take input of emebeding matrix which asseccoite with 
    the Positional encoding ww will cover later in section 
    Query and Key and Value all of them have the same dimession as the input 
    """
    try : 
        if batch_first==True:
            Query , Key , value = np.split(input_embedding@WieghtMatrix_QKY , 3 , axis=-1)
            if mask is not None:
                assert mask.shape[0] == input_embedding.shape[1],\
                    f"input dimession of mask doesn't match with dimession of embedding input:{mask.shape[0]} {input_embedding.shape[0]}"
                Attention = Softmax(Key@Query.swapaxes(-1,-2) / np.sqrt(input_embedding.shape[-1]) + mask) 
                return  Attention@value@out_wieghts , Attention
            else:
                Attention = Softmax(Key@Query.swapaxes(-1,-2) / np.sqrt(input_embedding.shape[-1])) 
                return  Attention@value@out_wieghts , Attention
    except:
        raise Exception("Batch argumment is missing")

1.2 Multi-Head Attention

def multiHeads_Attention(input_embedding ,wieghtsMatrix_QKY , heads ,out_Wieght , mask=None):
    B , seq_len , embed_size = input_embedding.shape
    # we have dim input of B . seq_len , 
    # embed_size ==> B, seq_len , embed_size/ heads 
    #=> Swape axis into [batch , heads , seq_len , embe_size / heads ]
    Query , Key, Value = np.split(input_embedding@wieghtsMatrix_QKY,3, axis=-1)
    Query , Key, Value = [a.reshape(B , seq_len ,heads , (embed_size // heads)).swapaxes(1,2) for a in (Query , Key, Value)]
    if mask is not None:
        atten = Softmax(Key@Query.swapaxes(-1,-2) / np.sqrt(embed_size // heads) + mask ) 
        return (atten@Value).swapaxes(1,2).reshape(B , seq_len , embed_size)@out_Wieght , atten
    else:
        atten = Softmax(Key@Query.swapaxes(-1,-2) / np.sqrt(embed_size // heads)) 
        return (atten@Value).swapaxes(1,2).reshape(B , seq_len , embed_size)@out_Wieght , atten

1.3 Scale Dot Product Attention

class ScaleDotProductAttention(nn.Module):
    """
    compute scale dot product attention

    Query : given sentence that we focused on (decoder)
    Key : every sentence to check relationship with Qeury(encoder)
    Value : every sentence same with Key (encoder)
    """

    def __init__(self,config):
        super(ScaleDotProductAttention, self).__init__()
        self.softmax = nn.Softmax(dim=-1)
        self.attention_dropout = nn.Dropout(config["attention_droput"])

    def forward(self, q, k, v,output_attentions=False):
        # input is 4 dimension tensor
        # [batch_size, head, length, d_tensor]
        batch_size, head, length, d_tensor = k.size()

        # 1. dot product Query with Key^T to compute similarity
        k_t = k.transpose(2, 3)  # transpose
        score = (q @ k_t) / math.sqrt(d_tensor)  # scaled dot product

        # 3. pass them softmax to make [0, 1] range
        score = self.softmax(score)
        score =  self.attention_dropout(score)

        # 4. multiply with Value
        v = score @ v
        if not output_attentions:
            return (v, None)

        return (v, score)

Positional Encodeing

import torch.nn as nn 
import numpy as np 

class PositionalEncoding(nn.Module):
    """
    compute sinusoid encoding.
    """
    def __init__(self, d_model, max_len):
        super(PositionalEncoding,self).__init__()    
        """
        constructor of sinusoid encoding class

        :param d_model: dimension of model
        :param max_len: max sequence length
        :param device: hardware device setting
        """
        super(PositionalEncoding, self).__init__()

        # same size with input matrix (for adding with input matrix)
        self.encoding = torch.zeros(max_len, d_model)
        self.encoding.requires_grad = False  # we don't need to compute gradient

        pos = torch.arange(0, max_len)
        pos = pos.float().unsqueeze(dim=1)
        # 1D => 2D unsqueeze to represent word's position

        _2i = torch.arange(0, d_model, step=2).float()
        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
        # "step=2" means 'i' multiplied with two (same with 2 * i)

        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))
        # compute positional encoding to consider positional information of words

    def forward(self, x):
        # self.encoding
        # [max_len = 512, d_model = 512]

        batch_size, seq_len = x.size()
        # [batch_size = 128, seq_len = 30]

        return self.encoding[:seq_len, :]
        # [seq_len = 30, d_model = 512]
        # it will add with tok_emb : [128, 30, 512]  

Residual Cennection

class ResidualCennection(nn.Module):
    def __init__(self,x, residual):
        super(ResidualCennection,self).__init__()
        self.pass_trough = x
        self.addtion = residual
    def forward(self,*args, **kwargs):
        x = self.pass_trough
        return x + self.addtion

1.4 Layer Norm And Softmax

def NormLayar(Z):
    mean = Z.mean(axis=-1 , keepdims=True)
    var = Z.var(axis=-1 , keepdims = True)
    return ((Z - mean) / np.sqrt(var)) + eps

def ReLU(Z):
    return np.maximum(Z,0)
    

def Softmax(z):
    """
    descriptions function : Softmax is non-linear function that give the averege of between 
    0 and 1 of in element in matrix 
    """
    e_x = np.exp(z - z.max(axis=-1,keepdims=True))
    return e_x / np.sum(e_x , axis=-1 ,keepdims=True)

1.5 MLP

class MLP(nn.Module):
    """
    A multi-layer perceptron module.
    """

    def __init__(self, config):
        super().__init__()
        self.dense_1 = nn.Linear(config["embedding_size"],config["mlp_ratio"] * config["embedding_size"])
        self.activation = nn.GELU()
        self.dense_2 = nn.Linear(config["embedding_size"] * config["mlp_ratio"], config["embedding_size"])
        self.dropout = nn.Dropout(config["hidden_dropout_prob"])

    def forward(self, x):
        x = self.dense_1(x)
        x = self.activation(x)
        x = self.dense_2(x)
        x = self.dropout(x)
        return x

1.6 TransformerEncoder

import numpy as np 
from layer import * 
from layerNumpy import *

def TransfomerEncoder(embed_input,mask ,
                      head, Wieghts_QKY , 
                      Wieghts_out ,FullyLinear1, 
                      FullLinear2 , eps):
    input_embedding = embed_input.numpy()
    multiHeads , _ = multiHeads_Attention(input_embedding,
                                     Wieghts_QKY.detach().numpy().T,
                                     head,
                                     Wieghts_out.detach().numpy().T,  
                                     mask=None)
    Residual = NormLayar((input_embedding + multiHeads) + eps )
    
    output = NormLayar((Residual + ReLU(np.matmul(Residual,FullyLinear1))@FullLinear2) + eps)
    return output

1.6 TransformerDecoder

import numpy as np 
from layer import *
from layerNumpy import *


def TransfomerDecoder(embed_input,mask ,
                      head, Wieghts_QKY , 
                      Wieghts_out ,FullyLinear1, 
                      FullLinear2 , eps,enc=None):
    input_embedding = embed_input.numpy()
    multiHeads , _ = multiHeads_Attention(input_embedding,
                                     Wieghts_QKY.detach().numpy().T,
                                     head,
                                     Wieghts_out.detach().numpy().T,  
                                     mask=None)
    Residual = NormLayar((input_embedding + multiHeads) + eps )
    
    if enc is not None:
        Query , key  = np.split(enc , 2 , axis=-1)
        enc_ = np.concatenate((Query[:,:,:16] , key[:,:,:16]  ,Residual[:,:,:32] ), axis = -1)
        MaskedMUltiHeads ,_= multiHeads_Attention(enc_,
                                         Wieghts_QKY.detach().numpy().T,
                                         head,
                                         Wieghts_out.detach().numpy().T,  
                                         mask=mask.numpy())
        Residual = NormLayar((enc + MaskedMUltiHeads) + eps )
    
    output = NormLayar((Residual + ReLU(np.matmul(Residual,FullyLinear1))@FullLinear2) + eps)
    return output

Transfomer model

from blocks.encoder import * 
from blocks.decoder import * 
import numpy as np

class Transformer:
    def __init__(self, embedding_size , heads):
        super(Transformer,self).__init__()
        
        """
        Blocks models
        -------------
        encoder : is used without the Mask at this stage 
        Decoder : include the Mask and with trick to split the Qurey and Key 
        
        Parametes :
        Wieghst_QKY_Encoder : is Leanrble wieght matrix feed into Encoder
        Wieghst_QKY_Decoder : is Leanrble wieght matrix feed into ncoder
        embed_size : is mebeding_size dimession of input 
        Seq_len : max lenght of vacolublaries 
        Linear_1 : Linear Denes layar of Deooder
        Linear_2 : Linear Denes layar of Deooder after include the ooutput from Encoder
        Linear_encoder : Linear Denes layar of Enooder
        """
    
        self.embed_input = embedding_size
        self.heads = heads 
        self.eps = 1e-12
        self.W_QKY_Encoder = transEncoder.self_attn.in_proj_weight
        self.W_Out_Encoder = transEncoder.self_attn.out_proj.weight
        self.W_QKY_Decoder = transDecoder.self_attn.in_proj_weight
        self.W_Out_Decoder = transDecoder.self_attn.out_proj.weight
        self.Linear1_Encoder = transEncoder.linear1.weight
        self.Linear2_Encoder = transEncoder.linear2.weight
        self.Linear1_Decoder = transDecoder.linear1.weight
        self.Linear2_Decoder = transDecoder.linear2.weight
    
    def forward(self, enc_ , dec_, mask):
        Encoder_= TransfomerEncoder(enc_ ,None ,
                  self.heads, self.W_QKY_Encoder , 
                  self.W_Out_Encoder ,self.Linear1_Encoder.detach().numpy().T, 
                  self.Linear2_Encoder.detach().numpy().T , self.eps)

        Decoder_ = TransfomerDecoder(dec_, mask ,
                  self.heads, self.W_QKY_Decoder , 
                  self.W_Out_Decoder , self.Linear1_Decoder.detach().numpy().T, 
                  self.Linear2_Decoder.detach().numpy().T ,self.eps,enc=Encoder_)

        return Decoder_