Qwen 3.5 is a new generation of large language models designed for high-quality reasoning and multimodal tasks.
FlashHead is Embedl’s drop-in replacement for the output (lm) head in language models. It reduces the cost of the final layer without retraining, improving inference speed while preserving model quality.
This post shows how the combination performs in practice, and why it works.
We applied FlashHead across the Qwen 3.5 family (0.8B → 27B) and validated accuracy on lm-eval benchmarks as well as on-device performance across multiple devices.
*Screenshots from https://github.com/embedl/Edge-Inference-Benchmarks
On edge hardware (Jetson Nano Super, Jetson AGX Orin, Jetson AGX Thor), the LM head is often a disproportionate bottleneck:
FlashHead targets exactly this.
Qwen 3.5 already pushes efficiency hard:
That means:
📊 https://github.com/embedl/Edge-Inference-Benchmarks
Prebuilt models
🤗 https://huggingface.co/collections/embedl/qwen35
If you're running models on edge or latency-constrained setups, this is one of the simplest wins available.
FlashHead models for Qwen 3.5 are available now:
🤗 https://huggingface.co/collections/embedl/qwen35
Install FlashHead:
pip install flash-head
Code and integration:
https://github.com/embedl/flash-head