Democratizing LLMs: Efficiency is the Key!

ChatGPT and GPT-4 ushered In a breathtaking revolution in AI with huge impact across society with over a hundred million users within a few months. However, there was concern that the benefits of these technologies may not be shared as widely as desired. These monster large language models (LLMs) have tens of trillions of parameters, which only the BigTech companies have the resources to train on, putting them well beyond the reach of most companies or research groups in universities , let alone innovative startups.

The announcement by Meta recently to open source their LLM LLaMa2 was heralded as a key step in opening LLMs to wider innovation. The president of global affairs at Meta and former UK deputy prime minister, Nick Clegg, said it was a way of using the “wisdom of crowds” to stimulate innovation and better safety. Others have voiced opinions that this was simply a strategic tactic in the AI race with its competitors. Meta’s LLaMa models come in a variety of sizes between 7 and 65 billion parameters, with the 13-billion-parameter version outperforming GPT-3 on most benchmarks.

Things started moving fast as typical of AI innovation these days. A group at Stanford took the 7-billion-parameter version of LLaMa and fine-tuned it to create a much smaller model called Alpaca, that was able to replicate much of the behavior of the original OpenAI model. The Alpaca group released their data and training recipe so others could replicate it. A blogger described it as LLM’s “Hugging Face moment” when various contributors started creating versions that would run on consumer hardware from GPUs to Rasberry Pi 4 to Pixel 6 iPhone.

The key to these developments of moving what was recently cutting edge research and technology at BigTech into the wider open innovation space are compression techniques to make the models smaller and more efficient. One such technique is quantization which reduces data requirements by representing weights using fewer bits—for instance, using 8-bit floating-point numbers rather than 32-bit floating point number. At first it seemed that such quantization would entail significant accuracy loss, but a recent paper at NeurIPS introduced a novel quantization scheme which achieved near-lossless compression of LLMs across model scales.

Other techniques including pruning and distillation can be combined with quantization to continue to make LLMs smaller and more efficient. Embedl’s technology is combining the best of these methods to be a leader in democratizing AI innovation.

Insider, Blog-

Democratizing LLMs: Efficiency is the Key!

Like it? Share it:

You may also like

Hardware Aware Model Optimization

Difficulties and Constraints in the Automotive Industry

Back to the Future with Analog for AI?