Denken
Menu
  • About Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials
  • Contact Me
Menu

1.2 – Fine Tune a Transformer Model (2/2)

Posted on December 2, 2021December 2, 2021 by Aritra Sen

In the last post we talked about that we can fine tune a BERT model using below two techniques –

  1. Update the weights of the pre-trained BERT model along with the classification layer.
  2. Update only the weights of the classification layer and not the pre-trained BERT model. This process becomes as using the pre-trained BERT model as a feature extractor.

    Also in the last post , we did hands-on coding part on the first process where we have updated the weights of the pretrained BERT model along with the classification layer. In this post we will talk about how to only update the weights of the classification layer and not the pre-trained BERT model. Using the pre-trained model we will extract the features then we will add a classifier on top of it as shown below –
Fine Tuning BERT


The dataset we will use is a kaggle TweetSentiment_Analysis dataset as used in the last post.

# Importing the required Library
import transformers
import torch
import numpy as np
from torch.nn import functional as F
import pandas as pd
import tqdm

# Reading the dataset with no columns titles and with latin encoding 
df_raw = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", header=None)

 # As the data has no column titles, we will add our own
df_raw.columns = ["label", "time", "date", "query", "username", "text"]
# Ommiting every column except for the text and the label, as we won't need any of the other information
df = df_raw[['label', 'text']]

label_dict = {4 : 1 , 0 : 0}

df.loc[:,'label'] = df['label'].map(label_dict)

#doing the train / test split

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'].values, df['label'].values, test_size=.2)

Next few steps are same as last post we will tokenize the train and validation texts and then create datasets and dataloaders.

from transformers import AutoTokenizer,AdamW,AutoModel 
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

train_input_ids = []
train_attention_mask = []
for text in tqdm.tqdm(train_texts):
    encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=64,
    padding = 'max_length',
    truncation = True,
    return_attention_mask= True,
    return_tensors='pt')
    train_input_ids.append(encoding['input_ids'])
    train_attention_mask.append(encoding['attention_mask'])
train_input_ids = torch.cat(train_input_ids,dim=0)
train_attention_mask = torch.cat(train_attention_mask,dim=0)

val_input_ids = []
val_attention_mask = []
for text in tqdm.tqdm(val_texts):
    encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=64,
    padding = 'max_length',
    return_attention_mask= True,
    truncation = True,
    return_tensors='pt')
    val_input_ids.append(encoding['input_ids'])
    val_attention_mask.append(encoding['attention_mask'])
val_input_ids = torch.cat(val_input_ids,dim=0)
val_attention_mask = torch.cat(val_attention_mask,dim=0)


# creating the train and validation datasets first then dataloaders
train_dataset = torch.utils.data.TensorDataset(train_input_ids,
                                               train_attention_mask,
                                               torch.tensor(train_labels,dtype=torch.long))
val_dataset = torch.utils.data.TensorDataset(val_input_ids,
                                             val_attention_mask,
                                             torch.tensor(val_labels,dtype=torch.long))

train_loader = torch.utils.data.DataLoader(train_dataset,shuffle=True,batch_size=32)
val_loader = torch.utils.data.DataLoader(val_dataset,shuffle=False,batch_size=32)

# selecting the device based on GPU availability
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In the last post we used the below statement to import the bert-pretrained model for sentiment analysis as shown below –

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')


This pretrained model is specialized for SequenceClassification and it already includes the classification layer along with feature extractor. In this tutorial we want to use BERT for the feature extraction purpose and then on top of it we want to add a trainable classifier. For this purpose we will import the pre-trained model from AutoModel class from tranformers library.We will freeze the weights for this pre-trained model. Freezing the weights is same as making this part of the model non trainable(required_grad=False i.e. no gradients calculation) , as shown in the below steps –

# Pretrained bert model for feature extraction
bert = AutoModel.from_pretrained('bert-base-uncased')

# freeze all the parameters
for param in bert.parameters():
    param.requires_grad = False

print(bert)
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (6): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (7): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (8): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (9): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (10): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)
# let's see if we pass one input_ids and mask to the BERT model, how the o/p looks like
for batch in train_loader:
    output = bert(batch[0],batch[1])
    break
print(output)
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 7.6130e-02,  3.0028e-01,  1.9767e-01,  ..., -4.6484e-01,
           6.5636e-01,  2.2707e-01],
         [ 1.0437e+00,  5.4522e-01,  5.0423e-01,  ..., -1.2399e-01,
           6.6002e-01,  4.3423e-01],
         [ 2.5797e-01, -2.0083e-01,  3.5641e-01,  ..., -8.8620e-01,
          -6.7701e-01, -2.9842e-01],
         ...,
         [-7.8971e-02, -1.2226e-01, -8.3569e-03,  ..., -1.8645e-01,
          -3.0247e-02,  2.6681e-01],
         [-2.7178e-04,  2.9616e-02, -5.3214e-04,  ..., -1.6130e-01,
          -2.0233e-01, -3.1436e-02],
         [-6.2698e-02,  9.6016e-02,  1.2038e-01,  ...,  2.0962e-01,
          -4.1216e-01, -4.7183e-01]],

        [[ 2.2744e-01, -1.7503e-01,  6.0626e-01,  ..., -3.6281e-01,
           1.5698e-01,  2.3664e-01],
         [ 8.5164e-01, -3.9539e-02,  1.1516e+00,  ..., -5.8906e-01,
           7.4140e-01, -7.7569e-01],
         [-1.5091e-01,  6.0787e-01,  1.0695e+00,  ..., -1.0550e+00,
          -2.0334e-01,  9.3699e-02],
         ...,
         [ 3.1049e-01,  2.4796e-01,  9.0660e-01,  ...,  2.9934e-01,
          -7.0334e-02,  2.9905e-02],
         [ 1.1044e-01, -2.1328e-01,  5.9599e-01,  ...,  2.9944e-01,
          -5.5401e-02, -9.2809e-02],
         [ 3.0435e-01, -1.0327e-01,  5.8737e-01,  ..., -2.6026e-02,
           1.3282e-02, -2.7680e-01]],

        [[-3.8740e-02,  3.8283e-01,  1.3519e-01,  ..., -4.7261e-01,
           2.9040e-01,  7.4302e-02],
         [ 1.2880e+00,  4.6513e-01,  4.9811e-01,  ..., -6.0443e-02,
           5.2943e-01,  2.5371e-02],
         [ 4.4679e-01,  2.4395e-01,  9.2009e-01,  ..., -1.3193e+00,
          -1.7171e-03, -4.6538e-01],
         ...,
         [-2.7403e-02,  1.1322e-01,  3.7078e-01,  ..., -1.4278e-01,
          -1.0403e-01,  3.1841e-02],
         [ 3.1821e-01,  1.8279e-01,  5.1534e-01,  ..., -2.1147e-01,
           5.4801e-02, -1.9411e-01],
         [ 1.2656e-01,  3.0878e-01,  4.1903e-01,  ..., -7.5436e-02,
          -1.9948e-01, -2.8558e-01]],

        ...,

        [[-2.0281e-01,  3.5621e-01, -7.0264e-02,  ..., -5.5653e-01,
           3.5597e-01,  5.8795e-01],
         [ 8.8901e-01,  1.2949e+00,  6.7760e-01,  ..., -7.7929e-01,
           2.3766e-01, -1.1659e-02],
         [ 6.9488e-01,  1.6480e-01,  3.5486e-01,  ..., -8.7871e-01,
           1.7530e-01,  8.2788e-01],
         ...,
         [ 3.2618e-02, -3.0351e-01,  8.9549e-02,  ..., -1.6387e-01,
          -1.3008e-02,  2.5270e-01],
         [-2.3254e-01, -1.7207e-02,  1.6763e-01,  ...,  3.7357e-02,
          -5.3747e-02,  1.4310e-01],
         [ 1.8346e-01, -1.6781e-01,  1.2193e-01,  ...,  2.4727e-01,
           2.3594e-02,  4.1763e-01]],

        [[-9.2725e-02,  3.0266e-01,  3.8117e-01,  ..., -3.6654e-01,
           3.2090e-01,  6.1811e-01],
         [-1.6589e-01,  2.5344e-01,  2.6113e-01,  ..., -5.4005e-01,
           4.3820e-01, -1.7630e-01],
         [-8.6363e-01,  9.1309e-02,  3.2616e-01,  ..., -6.7615e-01,
           5.6554e-03,  1.6757e-01],
         ...,
         [ 1.4002e-01,  1.5428e-01,  2.2178e-01,  ...,  8.9925e-02,
           1.8572e-01, -1.5675e-01],
         [ 8.8376e-02,  1.6205e-01,  2.6797e-01,  ..., -3.9465e-02,
           1.4420e-01, -2.9028e-01],
         [ 1.9893e-01,  2.1795e-01,  5.0799e-01,  ...,  6.4940e-01,
           1.0356e-01, -2.7263e-01]],

        [[ 3.3226e-02,  9.8368e-02, -9.4797e-02,  ..., -4.8308e-01,
           3.1923e-01,  3.6139e-01],
         [ 1.1744e+00,  3.4631e-01,  1.6627e-01,  ..., -1.5208e-01,
           3.8729e-01, -3.8763e-01],
         [ 2.8756e-02,  3.0364e-01,  4.5408e-01,  ..., -2.2403e-01,
           2.7785e-01, -6.8523e-01],
         ...,
         [ 1.2506e-01, -1.2031e-01,  3.0681e-01,  ..., -8.4977e-02,
           2.6747e-03, -4.9059e-02],
         [ 5.4640e-03, -1.2106e-01,  2.2333e-01,  ..., -1.5277e-01,
          -7.6455e-02, -2.8878e-03],
         [ 1.7237e-02,  1.0030e-01,  1.5545e-01,  ..., -1.0980e-02,
          -3.2600e-01,  3.4269e-02]]]), pooler_output=tensor([[-0.9039, -0.4003, -0.7570,  ..., -0.3390, -0.6723,  0.9357],
        [-0.6973, -0.3033, -0.6883,  ..., -0.3146, -0.5737,  0.8383],
        [-0.7385, -0.3278, -0.4051,  ...,  0.0765, -0.6580,  0.8070],
        ...,
        [-0.5511, -0.2757, -0.8041,  ..., -0.6374, -0.6196,  0.7604],
        [-0.7697, -0.2591, -0.6360,  ..., -0.1250, -0.5208,  0.8672],
        [-0.7645, -0.5344, -0.9521,  ..., -0.7028, -0.7307,  0.7830]]), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

We are interested in the pooler_output and not the last_hidden_state , this pooler_output we will send to the newly added classification layer –

print(output['pooler_output'])
tensor([[-0.9039, -0.4003, -0.7570,  ..., -0.3390, -0.6723,  0.9357],
        [-0.6973, -0.3033, -0.6883,  ..., -0.3146, -0.5737,  0.8383],
        [-0.7385, -0.3278, -0.4051,  ...,  0.0765, -0.6580,  0.8070],
        ...,
        [-0.5511, -0.2757, -0.8041,  ..., -0.6374, -0.6196,  0.7604],
        [-0.7697, -0.2591, -0.6360,  ..., -0.1250, -0.5208,  0.8672],
        [-0.7645, -0.5344, -0.9521,  ..., -0.7028, -0.7307,  0.7830]])
print(output['pooler_output'].shape) # batch size , feature size
torch.Size([32, 768])
print(output['last_hidden_state'].shape) # three dimensional hidden layer o/p
torch.Size([32, 64, 768])

Next we will build a Pytorch model class which will take the BERT feature extractor as input and add the classifier on top of it.

class BERT_Custom_Clf(torch.nn.Module):

    def __init__(self, bert):
        super(BERT_Custom_Clf, self).__init__()
        
        self.bert = bert
        # dropout layer
        self.dropout = torch.nn.Dropout(0.1)
        
        # relu activation function
        self.relu =  torch.nn.ReLU()
        
        # fully connected layer 1
        self.fc1 = torch.nn.Linear(768,512)

        # fully connected layer 2 (Output layer)
        self.fc2 = torch.nn.Linear(512,2)
        
        #softmax activation function
        self.softmax = torch.nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, input_id, attention_mask):

        #pass the inputs to the model
        output = self.bert(input_id, attention_mask=attention_mask)
      
        x = self.fc1(output['pooler_output']) # passing the pooler_output to fully connected layer
        
        x = self.relu(x)
        
        x = self.dropout(x)

        # output layer
        x = self.fc2(x)

        # apply softmax activation
        x = self.softmax(x)
        
        return x 
# pass the pre-trained BERT to our defined architecture
model = BERT_Custom_Clf(bert)

# push the model to device and make it ready for training
model = model.to(device)

model.train()

# create the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# define the loss function
criterion = torch.nn.NLLLoss()
# train the model for one epoch

batch_labels = []
batch_prediction = []
for batch in tqdm.tqdm(train_loader):
    optimizer.zero_grad()
    input_ids = batch[0].to(device)
    attention_mask = batch[1].to(device)
    labels = batch[2].to(device)
    outputs = model(input_ids, attention_mask)
    loss = criterion(outputs,labels)
    preds = torch.argmax(outputs,dim=1)
    loss.backward()
    optimizer.step()
    batch_labels.extend(labels.cpu().numpy())
    batch_prediction.extend(preds.cpu().numpy())
100%|██████████| 40000/40000 [42:24<00:00, 15.72it/s]

from sklearn.metrics import f1_score,accuracy_score
def calculate_model_performence(labels,prediction):
    print('F1 Score:',f1_score(labels,prediction))
    print('Accuracy :',accuracy_score(labels,prediction))


print('Training Performence')
print(calculate_model_performence(batch_labels,batch_prediction
Training Performence
F1 Score: 0.7406593270216731
Accuracy : 0.738784375
# making predictions on the test dataset

batch_labels = []
batch_prediction = []
model.eval()
for batch in tqdm.tqdm(val_loader):
    input_ids = batch[0].to(device)
    attention_mask = batch[1].to(device)
    labels = batch[2].to(device)
    with torch.no_grad():
        outputs = model(input_ids,attention_mask)
    preds = torch.argmax(outputs,dim=1)
    batch_labels.extend(labels.cpu().numpy())
    batch_prediction.extend(preds.cpu().numpy())
100%|██████████| 10000/10000 [10:18<00:00, 16.16it/s]
print('Validation Performence')
print(calculate_model_performence(batch_labels,batch_prediction))
Validation Performence
F1 Score: 0.769396762500276
Accuracy : 0.771578125

Running the training loop for more iteration can improve the performance on the validation set.

Thanks for reading and please comment if you have any questions.

Category: Aritra Sen, Machine Learning, Python

Post navigation

← 1.1 – Fine Tune a Transformer Model (1/2)
Graph Neural Network – Getting Started – 1.0 →

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RSS Feeds

Enter your email address:

Delivered by FeedBurner

Pages

  • About Me
  • Contact Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials

Tag Cloud

Announcements Anrdoid BERT Bias Celebration Cricket CyanogenMod deep-learning Denken Experience Facebook Features Finetuning GCN GenerativeAI GNN Google HBOOT HBOOT downgrading HTC Wildfire huggingface India Launch Life LLM Lumia 520 MachineLearning mobile My Space nlp Orkut People Python pytorch pytorch-geometric Rooting Sachin Share Social Network tranformers transformers Tutorials Twitter weight-initialization Windows Phone

WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.

Categories

Random Posts

  • A place for Celebration
  • #SachinIsGod
  • Orkut vs Facebook
  • Announcement: Launch of my YouTube channel focused on Data Science
  • Generative AI: LLMs: Finetuning Approaches 1.1

Recent Comments

  • Generative AI: LLMs: Reduce Hallucinations with Retrieval-Augmented-Generation (RAG) 1.8 – Denken on Generative AI: LLMs: Semantic Search and Conversation Retrieval QA using Vector Store and LangChain 1.7
  • vikas on Domain Fuss
  • Kajal on Deep Learning with Pytorch -Text Generation – LSTMs – 3.3
  • Aritra Sen on Python Tutorials – 1.1 – Variables and Data Types
  • Aakash on Python Tutorials – 1.1 – Variables and Data Types

Visitors Count

AmazingCounters.com

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Copyright

AritraSen’s site© This site has been protected from copyright by copyscape.Copying from this site is stricktly prohibited. Protected by Copyscape Original Content Validator
© 2025 Denken | Powered by Minimalist Blog WordPress Theme