FlashHead is built to reduce the cost of the LM head during inference. This update makes that path faster.
The change is simple from a user point of view: FlashHead now uses a more optimized GPU kernel for decoding. The model stays the same, the interface stays the same, but the runtime does less extra work.
We tested the update with embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead, a multi-modal model running in vLLM with video (4 FPS, 1280x720 images, 12 frames total).
End-to-end generation got faster on both tested devices:
That is a 5.6% speedup on A40 and an 8.7% speedup on Orin Nano Super.
The LM-head part itself improved substantially:
So on Orin Nano Super, the FlashHead decode step is almost 2× faster, which translates into a clear full-model throughput gain.
On edge devices, inference speed is often limited by memory movement and small repeated operations, not just raw compute.
FlashHead already avoids doing the full dense vocabulary projection. This update makes the remaining work cheaper by reducing intermediate steps inside the decode path.
The important part: users do not need to change how they run the model. This is an internal optimization that makes FlashHead faster in normal use.
This release does not change the FlashHead method or introduce new model claims.
It is an implementation improvement:
The improvement is especially relevant for edge deployments, where every repeated decode step matters.
You can learn more about this change by checking out the pull request, https://github.com/embedl/flash-head/pull/3
Benchmarks
Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.
Install FlashHead:
pip install flash-head
Code and implementation details:
https://github.com/embedl/flash-head
Example model:
https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead