20Mar 2025

OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USINGBITSANDBYTES FOR EFFICIENT AI DEPLOYMENT

  • Dept. of Manufacturing Engineering and Industrial Management, COEP Technological University Pune, India.
  • Dept. of Mechanical Engineering, COEP Technological University, Pune, India.
  • Dept. of Computer Science & IT, COEP Technological University, Pune, India.
  • Abstract
  • Keywords
  • Cite This Article as
  • Corresponding Author

Large Language Models (LLMs) have transformed natural language processing, which has achieved state-of-the-art performance on various tasks. However, their high computational and memory requirements lead to significant challenges for deployment, especially on resource-constrained hardware. In this paper, we conduct a controlled experiment to optimize the LLaMA 3.2 1B model using post-training quantization techniques implemented using the Bitsandbytes library. Evaluating multiple precision settings like BF16, FP16, INT8, and INT4 compare their accuracy, throughput, latency, and resource utilization tradeoffs. Experiments are conducted on a workstation GPU (NVIDIA T1000) for accuracy benchmarking and a cloud-based GPU (Nvidia T4 on Google Colab) for performance benchmarking. Our findings show that lower precision quantization can significantly reduce memory usage and improve throughput with minimal impact on model accuracy, providing valuable insights for efficient AI deployment for production environments.


[Neeraj Maddel, Shantipal Ohol and Anish Khobragade (2025); OPTIMIZING LLAMA 3.2 1B USING QUANTIZATION TECHNIQUES USINGBITSANDBYTES FOR EFFICIENT AI DEPLOYMENT Int. J. of Adv. Res. (Mar). 78-88] (ISSN 2320-5407). www.journalijar.com


Neeraj shashikant Maddel
COEP Technological University Pune
India

DOI:


Article DOI: 10.21474/IJAR01/20538      
DOI URL: https://dx.doi.org/10.21474/IJAR01/20538