Denken
Menu
  • About Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials
  • Contact Me
Menu

Generative AI: LLMs: How to do LLM inference on CPU using Llama-2 1.9

Posted on September 7, 2023September 7, 2023 by Aritra Sen

In the last few posts, we talked about how to use Llama-2 model for performing different NLP tasks and for most of the cases I have used GPU in Kaggle kernels. Now there can be requirements to that you don’t have GPU and you need to build some apps using CPU only. In this short post we will see how we can use ctransformers library to load and do inference using LLama-2 in CPU only. ctransformers library are python bindings for the Transformer models implemented in C/C++ using GGML library. Run the below command to install ctransformer library.

ctransformer library essentially helps to load the quantized models into CPU. With the ever-increasing size of LLMs, quantization plays a crucial role to use these giant models in community hardware efficiently with minimum compromise in the model performance. Recently, 8-bit and 4-bit quantization has helped us of running LLMs on consumer hardware. GGML (created by Georgi Gerganov , hence the name) was designed to be used with the llama.cpp library. The library is written in C/C++ for efficient inference of Llama models. It can load GGML models and run them on a CPU. To get the llama-2 7B GGML models for different quantization to this hugging-face link – TheBloke/Llama-2-7B-Chat-GGML at main (huggingface.co)

Based on the choice of quantization you can download the corresponding model file and place it in a local folder as shown below –

As you can see, I have downloaded two different quantized models from the above given link 2bit and 4 bit quantized models.

They follow a particular naming convention: “q” + the number of bits used to store the weights (precision) + a particular variant. Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by TheBloke:

  • q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
  • q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
  • q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
  • q3_k_s: Uses Q3_K for all tensors
  • q4_0: Original quant method, 4-bit.
  • q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
  • q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
  • q4_k_s: Uses Q4_K for all tensors
  • q5_0: Higher accuracy, higher resource usage and slower inference.
  • q5_1: Even higher accuracy, resource usage and slower inference.
  • q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
  • q5_k_s: Uses Q5_K for all tensors
  • q6_k: Uses Q8_K for all tensors
  • q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Once you download and place the model in your local file system, now you can easily load the model using below shown process and see how fast the model loads from disk –

Once loaded 4 bit llama-2 quantized model takes around 3.53 GB disk space. Similar way you can also load 13B Llama-2 models in your CPU also for inference from this link. – https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main

Using the 4bit quantized Llama-2 model and Gradio I have created the below shown demo using CPU only.

Demo of LLM using Llama-2 and Gradio

Do let me know in the comments if you like the video or not and in case you want me to create YouTube videos along with blogposts in future. Thanks for reading.

Reference: Quantize Llama models with GGML and llama.cpp | Towards Data Science

Category: Aritra Sen, Machine Learning, Python

Post navigation

← Generative AI: LLMs: Reduce Hallucinations with Retrieval-Augmented-Generation (RAG) 1.8
Announcement: Launch of my YouTube channel focused on Data Science →

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RSS Feeds

Enter your email address:

Delivered by FeedBurner

Pages

  • About Me
  • Contact Me
  • Deep Learning with Pytorch
  • Generative AI: Tutorial Series
  • Python Tutorials

Tag Cloud

Announcements Anrdoid BERT Bias Celebration Cricket CyanogenMod deep-learning Denken Experience Facebook Features Finetuning GCN GenerativeAI GNN Google HBOOT HBOOT downgrading HTC Wildfire huggingface India Launch Life LLM Lumia 520 MachineLearning mobile My Space nlp Orkut People Python pytorch pytorch-geometric Rooting Sachin Share Social Network tranformers transformers Tutorials Twitter weight-initialization Windows Phone

WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.

Categories

Random Posts

  • Deep Learning with Pytorch-nn.Sequential,GPU,Saving & Loading Model – 1.3
  • Python Tutorials – 1.0 – Getting Started
  • Deep Learning with Pytorch -Sequence Modeling – Getting Started – RNN – 3.0
  • 1.2 – Fine Tune a Transformer Model (2/2)
  • Deep Learning with Pytorch -CNN – Transfer Learning – 2.2

Recent Comments

  • Generative AI: LLMs: Reduce Hallucinations with Retrieval-Augmented-Generation (RAG) 1.8 – Denken on Generative AI: LLMs: Semantic Search and Conversation Retrieval QA using Vector Store and LangChain 1.7
  • vikas on Domain Fuss
  • Kajal on Deep Learning with Pytorch -Text Generation – LSTMs – 3.3
  • Aritra Sen on Python Tutorials – 1.1 – Variables and Data Types
  • Aakash on Python Tutorials – 1.1 – Variables and Data Types

Visitors Count

AmazingCounters.com

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Copyright

AritraSen’s site© This site has been protected from copyright by copyscape.Copying from this site is stricktly prohibited. Protected by Copyscape Original Content Validator
© 2025 Denken | Powered by Minimalist Blog WordPress Theme