Embedl has released a major milestone for efficient LLMs: introducing FlashHead, a training-free, hardware-friendly drop-in replacement for the dense classification head in language models, and sharing six fully optimized SLMs built using this technology. These models are now available at our HuggingFace page (https://huggingface.co/embedl).
As demand grows for Small Language Models (SLMs) on phones, embedded devices, and low-power edge hardware, it has become crucial to optimize these models for efficient execution. The bottleneck in SLMs today is the classification head, which is the output layer that translates hidden states into token probabilities. In modern models like Llama-3.2, Gemma-3, or Qwen-3, this layer accounts for 20–60% of the parameters, and up to 50% of the compute time. This layer scales directly with vocabulary size, and vocabulary sizes continue to increase. Embedl’s latest work solves this bottleneck in a fundamentally new way.
FlashHead replaces the traditional dense output head with a retrieval-style, multi-stage architecture, inspired by information-retrieval systems rather than matrix-heavy deep learning layers. It keeps accuracy aligned with the original model, but dramatically reduces compute.
The method introduces four major innovations, all designed for real hardware efficiency:
Whereas prior work typically restricts retrieval to a single or a few most likely clusters, we propose a new inference-efficient multi-probing algorithm. This enables us to depart significantly from existing applications of information retrieval in language models by scaling from hundreds of clusters to tens of thousands. It uses hundreds to thousands of probes to efficiently search through these clusters, on a scale that previous methods couldn’t support, recovering the full accuracy of dense logits at a fraction of the cost.
An enabling feature of our multi-probing algorithm is that FlashHead clusters token embeddings into strictly equal-size groups, a first in LLM output-layer design. This enables dense, predictable memory layouts and fast access patterns on GPUs and edge accelerators, something irregular clusters cannot achieve.
Unlike hierarchical softmax or approximate nearest-neighbor techniques, FlashHead is compatible with full probabilistic decoding. It samples probe clusters using a multinomial distribution before selecting tokens, maintaining high-fidelity token sampling for open-ended generation.
Because the coarse centroid scoring stage is robust to noise, FlashHead allows safe application of low-bit quantization (e.g., int4) without the accuracy drop typically associated with quantizing LLM heads.
The result is a drop‑in replacement for the dense output head that requires no retraining, no task‑specific calibration, has a hardware‑friendly design, is compatible with standard GPU inference stacks, and is equivalent in function and interface for the end user
Together, these features turn the output layer from a bottleneck into an afterthought, without the accuracy drop typically associated with quantizing LLM heads.
In extensive evaluations across Llama-3.2, Llama-3.1, Qwen-3, and Gemma-3, FlashHead delivers significant improvements on top of existing state-of-the-art:
It also scales from 270M to 8B models, demonstrating improved latency on both edge GPUs and CPUs.
FlashHead’s efficiency holds even under quantization. With int4-quantized stage-1, FlashHead outperforms quantized dense heads, preserving higher accuracy while still reducing latency. FlashHead pairs naturally with mixed‑precision quantization:
Established models typically use a homogeneous setting, where all layers are set to the same quantization level. This is due to the combinatorial explosion of possible configurations when considering per-layer precision levels. The models we release have been carefully selected for their precision levels through advanced mixed-precision optimization to ensure an optimal configuration.
One example is Llama‑3.2‑1B running with FlashHead on an NVIDIA RTX 3500 Ada Generation GPU at batch size 1 (typical for interactive local use):
Embedl will publish six highly-optimized, FlashHead-powered LLMs on Hugging Face, covering:
Each model has been integrated with FlashHead and low-bit quantization, tuned for:
These releases establish Embedl’s models as the fastest open-source LLMs for on-device inference.
The era of massive data centers is giving way to a complementary trend: capable AI running locally on small, energy-efficient hardware.
However, for this future to scale, every component must be optimized, not just attention or quantization, but the entire inference pipeline.
FlashHead removes a key architectural bottleneck in SLMs.
It enables:
By releasing the six optimized models openly, Embedl is accelerating progress across the LLM ecosystem.
With FlashHead, Embedl pushes a new frontier in efficient LLM design, one that takes a holistic view of inference performance. This work sets a new baseline for what’s possible on consumer-grade hardware.
If you’re building for the edge, embedded systems, or cost-sensitive deployments, what Embedl is releasing is likely to reshape your design decisions.
Check out our models on Hugging Face and sign up here to be the first to know when new models are released. More models are on the way, including vision-language models, vision-language–action models, and larger-scale models.
"Built with Llama"