Faster Multi-Modal Reasoning with FlashHead Triton Kernel

Written by Embedl | May 12, 2026 7:21:31 AM

FlashHead is built to reduce the cost of the LM head during inference. This update makes that path faster.

The change is simple from a user point of view: FlashHead now uses a more optimized GPU kernel for decoding. The model stays the same, the interface stays the same, but the runtime does less extra work.

Results

We tested the update with embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead, a multi-modal model running in vLLM with video (4 FPS, 1280x720 images, 12 frames total).

End-to-end generation got faster on both tested devices:

A40: 318.91 → 336.73 tokens/sec
Orin Nano Super: 70.33 → 76.46 tokens/sec

That is a 5.6% speedup on A40 and an 8.7% speedup on Orin Nano Super.

The LM-head part itself improved substantially:

A40: 341 µs → 191 µs
Orin Nano Super: 1821 µs → 951 µs

So on Orin Nano Super, the FlashHead decode step is almost 2× faster, which translates into a clear full-model throughput gain.

Why it matters

On edge devices, inference speed is often limited by memory movement and small repeated operations, not just raw compute.

FlashHead already avoids doing the full dense vocabulary projection. This update makes the remaining work cheaper by reducing intermediate steps inside the decode path.

The important part: users do not need to change how they run the model. This is an internal optimization that makes FlashHead faster in normal use.

Same model, faster path

This release does not change the FlashHead method or introduce new model claims.

It is an implementation improvement:

same FlashHead models
same vLLM workflow
lower LM-head latency
better end-to-end throughput

The improvement is especially relevant for edge deployments, where every repeated decode step matters.

Learn more

You can learn more about this change by checking out the pull request, https://github.com/embedl/flash-head/pull/3

Benchmarks

Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Try it out

Install FlashHead:

pip install flash-head

Code and implementation details:
https://github.com/embedl/flash-head

Example model:
https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead

View full post