FlashHead is built to reduce the cost of the LM head during inference. This update makes that path faster.
The change is simple from a user point of view: FlashHead now uses a more optimized GPU kernel for decoding. The model stays the same, the interface stays the same, but the runtime does less extra work.
Results
We tested the update with embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead, a multi-modal model running in vLLM with video (4 FPS, 1280x720 images, 12 frames total).
End-to-end generation got faster on both tested devices:
- A40: 318.91 → 336.73 tokens/sec
- Orin Nano Super: 70.33 → 76.46 tokens/sec
That is a 5.6% speedup on A40 and an 8.7% speedup on Orin Nano Super.
The LM-head part itself improved substantially:
- A40: 341 µs → 191 µs
- Orin Nano Super: 1821 µs → 951 µs
So on Orin Nano Super, the FlashHead decode step is almost 2× faster, which translates into a clear full-model throughput gain.
Why it matters
On edge devices, inference speed is often limited by memory movement and small repeated operations, not just raw compute.
FlashHead already avoids doing the full dense vocabulary projection. This update makes the remaining work cheaper by reducing intermediate steps inside the decode path.
The important part: users do not need to change how they run the model. This is an internal optimization that makes FlashHead faster in normal use.
Same model, faster path
This release does not change the FlashHead method or introduce new model claims.
It is an implementation improvement:
- same FlashHead models
- same vLLM workflow
- lower LM-head latency
- better end-to-end throughput
The improvement is especially relevant for edge deployments, where every repeated decode step matters.
Learn more
You can learn more about this change by checking out the pull request, https://github.com/embedl/flash-head/pull/3
Benchmarks
Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.
Try it out
Install FlashHead:
pip install flash-head
Code and implementation details:
https://github.com/embedl/flash-head
Example model:
https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead