Large Language Models (LLMs) are all the rage these days! They have revolutionized areas like natural language processing, speech recognition and lately computer vision and Generative AI such as SORA, enabling machines to understand and generate outputs – text, images and video - with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. To train these LLMs requires humongous computing and energy resources costing hundreds of thousands of dollars. This puts them out of reach of everyone except large corporations that can afford these resources.

 

Some amazing recent developments in compression methods have now changed the situation dramatically. One can now create small versions of LLMs that can be deployed on cheap hardware and achieve performance which is not too much worse than the large LLMs. 

 
One of the ingredients are a number of recent methods to quantize models. Previous methods for quantization have needed to use calibration data. This can be problematic because of access to data, bias and poor quality and because it takes very long – from several hours to days. A new method called HQQ is able to do quantization without the needs for calibration data and running orders of magnitude faster – even 100 times faster, so huge networks can be quantized in minutes. The magic behind HQQ is to bring some classic techniques from convex optimization. They approach the problem by formulating an optimization problem involving a sparsity inducing norm. Next, they’re able to decompose the optimization problem into two sub-problems, which are solved in alternating fashion. The nice thing is that both sub-problem have direct closed form solutions which are classics in the convex optimization literature. Thus the algorithm does not need to do any gradient iterations, but simply applies the closed form solutions alternatingly. The whole procedure converges in very few steps, explaining the huge gain speed. 
 
 
While such post-training quantization can work with INT 8 precision, going to extreme precision can really degrade performance. One would now like to do some fine tuning to regain performance. However, fine tuning the huge model is prohibitively expensive.  Moreover, gradient descent used for model training will observe zero gradients nearly everywhere with low precision, so it can’t make meaningful updates to the quantized weights. Here a technique called QLoRA has been very effective. LoRA “Low-Rank Adaptation of Large Language Models”. doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller than 1% of the full model) that are trained, while keeping the rest of the model constant. QLoRA is LoRA used with a quantized model.
 

The results are quite dramatic! The authors of HQQ and their collaborators report that directly applying 1-bit quantization to small models like the Llama2-7B yields suboptimal results. However, when the model is fine-tuned, its output quality improves substantially. Remarkably, the fine-tuned 1-bit base model surpasses the performance of  Quip# 2-bit , despite being trained on only ~2.8K samples with a context-window of 1024.
 

2-bit: when given more specialized data, a 2-bit model can perform very well. In fact, the base Llama2-7B 2-bit model with a version of HQQ and QLoRA  outperforms the full-precision model on wikitext. The chat model outperforms its full-precision version on GSM8K when given enough math and reasoning data.
 

QLoRA is part of a larger toolkit of methods called “Parameter Efficient Fine Tuning” (PEFT) which train only a very small part of the full network (often less than 1%). With PEFT methods, it becomes possible to do fine tuning of LLMs on smaller compute resources and in more reasonable time frames – minutes instead of hours or days – and recover performance almost to the level of the original models.
 

These are very exciting days when a combination of such techniques promises to deliver LLMs that can be used on consumer hardware, and greatly expand the range of innovations they can enable. Embedl is excited to be part of enabling this transformation!

 

Professor Devdatt Dubhashi

Professor Devdatt Dubhashi

Chief Scientific Officer and co-founder

Professor Devdatt Dubhashi, Chief Scientific Officer and co-founder is Professor in the Data Science and AI Division of the Department of Computer Science and Engineering at Chalmers University of Technology and co-founder of Embedl. He received his Ph.D. in Computer Science from Cornell University USA and was a postdoctoral fellow at the Max Planck Institute for Computer Science in Saarbrueken Germany. He was with BRICS (Basic Research in Computer Science, a center of the Danish National Science Foundation) at the University of Aarhus and then on the faculty of the Indian Institute of Technology (IIT) Delhi before joining Chalmers in 2000. He has led several national projects in machine learning and has been associated with several EU projects. He has been an external expert for the OECD report on “Data-Driven Innovation”. He has published regularly in the premier machine learning and AI venues such as NIPS, ICML, and AAAI.

You may also like

Difficulties and Constraints in the Automotive Industry
Difficulties and Constraints in the Automotive Industry
28 August, 2023

The automotive industry is undergoing a remarkable transformation through the integration of deep learning in embedded s...

Challenges Faced by R&D Teams in Deep Learning Development
Challenges Faced by R&D Teams in Deep Learning Development
17 April, 2024

Research and development (R&D) teams are at the heart of innovation and progress across industries. Whether in techn...

Hardware-Agnostic Deep Learning
Hardware-Agnostic Deep Learning
24 April, 2023

Hardware-Agnostic Deep Learning: Optimize, Adapt, and Deploy with Embedl Neural Compression SDK. Design once - deploy ev...