Denken
Menu
  • About Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials
  • Contact Me
Menu

1.0 – Getting started with Transformers for NLP

Posted on November 22, 2021November 22, 2021 by Aritra Sen

In this post we will go through about how to do hands-on implementation with Hugging Face transformers library for solving few simple NLP tasks, we will mainly talk about hands-on part , in case you are interested to learn more about transformers/attention mechanism , below are few resources –

Getting started with Google Bert
Neural Machine Translation
Illustrated Transformer

Few of the prerequisites for this posts are –

  • Basic knowledge of transformer architecture
  • Deep learning
  • Basics of Pytorch implementation

So if you have pytorch/tensorflow installed in your system , then you can straightway install Hugging Face transformers library with below command –

! pip install transformers

At first we will start with very basics of transformers , what is pipeline , inner workings of tokenizers , how to use pre-trained models, how to do sentence classification with few lines of code using pre-trained models.Then in the next post we will fine tune a pre-trained BERT model in Pytorch for twitter sentiment analysis.

Let’s first import the required libraries

import transformers
import torch
import numpy as np
from torch.nn import functional as F
import pandas as pd
import tqdm

Pipelines

We will go through the pipeline component of transformers , The pipelines are a great and easy way to use models for inference.

Pipelines are made of:

  • A tokenizer in charge of mapping raw textual input to token.
  • A model to make predictions from the inputs.
  • Some (optional) post processing for enhancing model’s output.

    Next we will try to sentiment classification with pipeline
classifier = transformers.pipeline('sentiment-analysis') # mention the pipeline name 
result = classifier(['We are learning tranformers'])
print(result)
[{'label': 'POSITIVE', 'score': 0.9731776118278503}]

In the above snippet with pipeline abstraction we can pass the task name to perform any task with minimal coding. Classifier output returns the label with probability score. Below are few of the supported tasks available currently –

    "audio-classification"

    "automatic-speech-recognition"

    "image-classification"

    "question-answering"

    "text-classification" (alias "sentiment-analysis" available)

We can also do the same with multiple sentences in below shown way –

results = classifier(['We are learning tranformers', 'I am not happy'])


for result in results:
    print(result)
{'label': 'POSITIVE', 'score': 0.9731776118278503}
{'label': 'NEGATIVE', 'score': 0.9997896552085876}

We can also pass a pre-trained model name and pass it as argument to the pipeline for sentiment analysis classification task. Here we will get the same result as under the hood we are using same model in the previous step and in this current step.

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
classifier = transformers.pipeline('sentiment-analysis' , model=model_name)
results = classifier(['We are learning tranformers', 'I am not happy'])

for result in results:
    print(result)
{'label': 'POSITIVE', 'score': 0.9731776118278503}
{'label': 'NEGATIVE', 'score': 0.9997896552085876}

Based the task we want to perform , we can select different types of models. Using this documentation link you can choose the model or check point name.

We can also choose a custom tokenizer along with model inside pipeline to do the classifiction task as shown below , we will again get the same result as both tokenizer and model are same from previous tasks. Generally , to match the architecture and other properties we use the same model_name for tokenizer and model. from_pretrained method we use to fetch details from per-trained models.

from transformers import AutoTokenizer ,AutoModelForSequenceClassification

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = transformers.pipeline('sentiment-analysis' , model=model, tokenizer=tokenizer)
results = classifier(['We are learning tranformers', 'I am not happy'])

for result in results:
    print(result)

{'label': 'POSITIVE', 'score': 0.9731776118278503}
{'label': 'NEGATIVE', 'score': 0.9997896552085876}

Tokenizer:

Now we will discuss few of the inner working of Hugging Face tokenizer process. Below are few of the things to keep in mind –

  1. Basic tokens as output from the tokenizer. we will use the previously created tokenizer object for the below steps –
tokens = tokenizer.tokenize('We are learning tranformers')
print(tokens)
['we', 'are', 'learning', 'tran', '##form', '##ers']

2. We get token_ids from tokens from tokenizer object , with this process each token gets mapped to unique token ids (Examples ‘we’ -> 2057)

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
[2057, 2024, 4083, 25283, 14192, 2545]

3. Now we look at Tokenizing a complete sentence and printing the corresponding output. If you closely below outputs we can see two things have been printed , input_ids which is same as above token_ids apart from two extra tokens to mark the beginning and end of sentence. attention_mask tells us whether the tokens are padding or not , if the input_ids are not padding then the corresponding values are 1 and if its padding then the value would be 0.

print(tokenizer('We are learning tranformers'))
{'input_ids': [101, 2057, 2024, 4083, 25283, 14192, 2545, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

4. We work with batches of sentences , then we need to mindful that each sentence may not be or same length to tackle the same we need enable few options which doing the tokenization process.

padding=True – to pad the sentences of smaller length while compared to the biggest sentence in the batch.
truncation=True – will truncate the sentence to given max_length
return_tensors=’pt’ – will return the token byt converting them to pytorch tensors.

data = ['We are learning tranformers', 'I am not happy' , 'We are happy to learn transformers']

batch = tokenizer(data,padding=True,truncation=True,max_length=512,return_tensors='pt')

print(batch)
{'input_ids': tensor([[  101,  2057,  2024,  4083, 25283, 14192,  2545,   102],
        [  101,  1045,  2572,  2025,  3407,   102,     0,     0],
        [  101,  2057,  2024,  3407,  2000,  4553, 19081,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

Making Predictions with pre-trained models for a batch

We are using the model defined in earlier to make predictions on the batch of data we just created in previous step, inline comments are provided for explanation.

with torch.no_grad(): # disable gradient computation
    results = model(**batch , labels=torch.tensor([1,0,1]))  # without labels loss won't be printed 
    print(results)
    predictions = torch.softmax(results.logits,dim=1) # normalization of logits
    print(predictions)
    classes = torch.argmax(predictions,dim=1) #taking argmax to select the class with highest probability
    print(classes)
    labels = [model.config.id2label[c] for c in classes.tolist()] # pretrained models has id2label property to get the class names
    print(labels)
SequenceClassifierOutput(loss=tensor(0.0092), logits=tensor([[-1.8314,  1.7600],
        [ 4.7199, -3.7467],
        [-4.1593,  4.4165]]), hidden_states=None, attentions=None)
tensor([[2.6822e-02, 9.7318e-01],
        [9.9979e-01, 2.1033e-04],
        [1.8858e-04, 9.9981e-01]])
tensor([1, 0, 1])
['POSITIVE', 'NEGATIVE', 'POSITIVE']

In the next post then we will put together everything and fine tune a pre-trained BERT model in Pytorch and in HuggingFace in built process for twitter sentiment analysis.

Thanks for reading and please comment if you have any questions.

Category: Aritra Sen, Machine Learning, Python

Post navigation

← Deep Learning with Pytorch -Text Generation – LSTMs – 3.3
1.1 – Fine Tune a Transformer Model (1/2) →

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RSS Feeds

Enter your email address:

Delivered by FeedBurner

Pages

  • About Me
  • Contact Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials

Tag Cloud

Announcements Anrdoid BERT Bias Celebration Cricket CyanogenMod deep-learning Denken Experience Facebook Features Finetuning GCN GenerativeAI GNN Google HBOOT HBOOT downgrading HTC Wildfire huggingface India Launch Life LLM Lumia 520 MachineLearning mobile My Space nlp Orkut People Python pytorch pytorch-geometric Rooting Sachin Share Social Network tranformers transformers Tutorials Twitter weight-initialization Windows Phone

WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.

Categories

Random Posts

  • Generative AI: LLMs: Getting Started 1.0
  • Google+ All you need to know
  • Deep Learning with Pytorch – Neural Network Implementation – 1.1
  • A place for Celebration
  • Are we Unbiased?

Recent Comments

  • Generative AI: LLMs: Reduce Hallucinations with Retrieval-Augmented-Generation (RAG) 1.8 – Denken on Generative AI: LLMs: Semantic Search and Conversation Retrieval QA using Vector Store and LangChain 1.7
  • vikas on Domain Fuss
  • Kajal on Deep Learning with Pytorch -Text Generation – LSTMs – 3.3
  • Aritra Sen on Python Tutorials – 1.1 – Variables and Data Types
  • Aakash on Python Tutorials – 1.1 – Variables and Data Types

Visitors Count

AmazingCounters.com

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Copyright

AritraSen’s site© This site has been protected from copyright by copyscape.Copying from this site is stricktly prohibited. Protected by Copyscape Original Content Validator
© 2025 Denken | Powered by Minimalist Blog WordPress Theme