Running advanced multi-modal reasoning models on edge hardware has traditionally required large GPUs and tens of gigabytes of memory. Today, Embedl changes that. We’re releasing Cosmos Reason2 with FlashHead, a new optimized model that delivers fast multi-modal reasoning on devices with less than 8GB of RAM while maintaining strong reasoning performance.
By combining our Edge2 mixed-precision quantization with FlashHead, our new ultra-efficient output head, we remove major compute and memory bottlenecks in the inference pipeline. The result is a state-of-the-art multi-modal reasoning model that runs faster than previous Cosmos Reason2 deployments, enabling real-time reasoning for robotics, edge AI systems, and interactive applications on compact hardware.
Cosmos Reason2 is a multimodal reasoning VLM (text + image/video → text) based on Qwen3-VL, designed for physical AI workloads where latency and memory bandwidth often matter more than peak datacenter throughput.
With W4A16‑Edge2, we showed by keeping a small, carefully selected set of sensitive layers in FP16 while quantizing aggressively elsewhere, we can recover most of the baseline-level reasoning accuracy while maintaining the deployment gains of W4A16.
On Jetson AGX Orin (batch size 1), W4A16‑Edge2 more than halves end-to-end latency across text/video/image workloads.
So what’s left to optimize?
Even after you quantize the transformer body, a major cost often remains:
The dense classification / LM head; the output layer that converts hidden states into token probabilities.
In modern language model architectures (Llama‑3.2, Gemma‑3, Qwen‑3, etc.), the output head can account for 20–60% of parameters and up to 50% of compute time, and its cost scales directly with vocabulary size (Blog: Ultra-Efficient SLMs). For edge deployments, that’s exactly the wrong place to waste compute.
FlashHead replaces the traditional dense head with a retrieval-style, multi-stage architecture, inspired by information retrieval rather than a “giant matrix multiply.”
W4A16‑Edge2 already optimizes the transformer body by mixing INT4 weight quantization with FP16 retention in sensitive layers.
Cosmos Reason2 + FlashHead extends the optimization to the final-stage token selection path, turning a repeated per-token bottleneck into a much lighter operation, exactly where decoding latency accumulates.
In other words:
W4A16-Edge2 was designed to preserve the reasoning accuracy of the original unquantized model. FlashHead targets the remaining bottlenecks in the output head. Combined with W4A16-Edge2, it further reduces latency while maintaining the same accuracy.
Below is a table of inference speed (tokens per second) over all possible modalities. More details on benchmarks can be found here.
|
Device |
Modality |
Original (FP16) |
W4A16-Edge2 |
W4A16-Edge2-FlashHead |
|
Orin Nano Super |
text |
OOM |
56.96 |
75.33 |
|
Orin Nano Super |
image |
OOM |
42.33 |
51.84 |
|
Orin Nano Super |
video |
OOM |
43.62 |
53.48 |
|
AGX Orin |
text |
45.55 |
99.84 |
132.52 |
|
AGX Orin |
image |
26.47 |
71.94 |
85.90 |
|
AGX Orin |
video |
39.58 |
74.93 |
92.21 |
|
AGX Thor |
text |
65.20 |
116.43 |
189.16 |
|
AGX Thor |
image |
52.96 |
87.42 |
117.38 |
|
AGX Thor |
video |
56.22 |
90.03 |
128.15 |
Test on hardware with minimal setup
|
docker run --rm -it \ --network host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --runtime=nvidia \ --name=vllm-serve \ -e HF_TOKEN=hf_*** \ -e HF_HOME=/root/.cache/huggingface \ embedl/vllm:latest-jetson-orin-flashhead \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \ --max-model-len 8192 \ --gpu-memory-utilization 0.75 \ --max-num-seqs 2 \ --trust-remote-code |
As physical AI systems move from prototypes to deployed robots and autonomous vehicles, two constraints dominate:
W4A16‑Edge2 established that Cosmos Reason2 can meet strict edge memory + latency constraints without giving up benchmark-level reasoning quality.
Cosmos Reason2 + FlashHead is the next step: it targets the part of the inference loop that still repeats at every token, the output head, and removes it as a practical bottleneck.
Hugging Face model link
Edge Inference Benchmarks page