20Mar 2025

OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USINGBITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

Neeraj Maddel , Shantipal Ohol and Anish Khobragade

Dept. of Manufacturing Engineering and Industrial Management, COEP Technological University Pune, India.
Dept. of Mechanical Engineering, COEP Technological University, Pune, India.
Dept. of Computer Science & IT, COEP Technological University, Pune, India.

Abstract
Keywords
Cite This Article as
Corresponding Author

Large Language Models (LLMs) have transformed natural language processing, which has achieved state-of-the-art performance on various tasks. However, their high computational and memory requirements lead to significant challenges for deployment, especially on resource-constrained hardware. In this paper, we conduct a controlled experiment to optimize the LLaMA 3.2 1B model using post-training quantization techniques implemented using the Bitsandbytes library. Evaluating multiple precision settings like BF16, FP16, INT8, and INT4 compare their accuracy, throughput, latency, and resource utilization tradeoffs. Experiments are conducted on a workstation GPU (NVIDIA T1000) for accuracy benchmarking and a cloud-based GPU (Nvidia T4 on Google Colab) for performance benchmarking. Our findings show that lower precision quantization can significantly reduce memory usage and improve throughput with minimal impact on model accuracy, providing valuable insights for efficient AI deployment for production environments.

[Neeraj Maddel, Shantipal Ohol and Anish Khobragade (2025); OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USINGBITSANDBYTES FOR EFFICIENT AI DEPLOYMENT Int. J. of Adv. Res. (Mar). 78-88] (ISSN 2320-5407). www.journalijar.com

Neeraj shashikant Maddel
COEP Technological University Pune
India

DOI:

Article DOI: 10.21474/IJAR01/20538
DOI URL: https://dx.doi.org/10.21474/IJAR01/20538

Download Full Paper

Download PDF No. of Downloads: 25 | No. of Views: 40

This work is licensed under a Creative Commons Attribution 4.0 International License.

OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USINGBITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

DOI:

Download Full Paper

Share this article