Supercharging Cosmos Reason2 with FlashHead Across the Jetson Platform (2)

Running advanced multi-modal reasoning models on edge hardware has traditionally required large GPUs and tens of gigabytes of memory. Today, Embedl changes that. We’re releasing Cosmos Reason2 with FlashHead, a new optimized model that delivers fast multi-modal reasoning on devices with less than 8GB of RAM while maintaining strong reasoning performance.

By combining our Edge2 mixed-precision quantization with FlashHead, our new ultra-efficient output head, we remove major compute and memory bottlenecks in the inference pipeline. The result is a state-of-the-art multi-modal reasoning model that runs faster than previous Cosmos Reason2 deployments, enabling real-time reasoning for robotics, edge AI systems, and interactive applications on compact hardware.

 

Quick background: Cosmos Reason 2 and Quantization

Cosmos Reason2 is a multimodal reasoning VLM (text + image/video → text) based on Qwen3-VL, designed for physical AI workloads where latency and memory bandwidth often matter more than peak datacenter throughput.

With W4A16‑Edge2, we showed by keeping a small, carefully selected set of sensitive layers in FP16 while quantizing aggressively elsewhere, we can recover most of the baseline-level reasoning accuracy while maintaining the deployment gains of W4A16.

On Jetson AGX Orin (batch size 1), W4A16‑Edge2 more than halves end-to-end latency across text/video/image workloads.

So what’s left to optimize?

 

The next bottleneck: the dense output head

Even after you quantize the transformer body, a major cost often remains:

The dense classification / LM head; the output layer that converts hidden states into token probabilities.

In modern language model architectures (Llama‑3.2, Gemma‑3, Qwen‑3, etc.), the output head can account for 20–60% of parameters and up to 50% of compute time, and its cost scales directly with vocabulary size (Blog: Ultra-Efficient SLMs). For edge deployments, that’s exactly the wrong place to waste compute.

FlashHead replaces the traditional dense head with a retrieval-style, multi-stage architecture, inspired by information retrieval rather than a “giant matrix multiply.”

 

Why Cosmos Reason2 + FlashHead is faster

W4A16‑Edge2 already optimizes the transformer body by mixing INT4 weight quantization with FP16 retention in sensitive layers.

Cosmos Reason2 + FlashHead extends the optimization to the final-stage token selection path, turning a repeated per-token bottleneck into a much lighter operation, exactly where decoding latency accumulates.

In other words:

  • W4A16-Edge2: Quantization without sacrificing reasoning quality.
  • FlashHead: Removes the output-head bottleneck that quantization alone cannot eliminate.
  • Combined: Lower latency with FlashHead while preserving Edge2’s accuracy.

 

Benchmarks

W4A16-Edge2 was designed to preserve the reasoning accuracy of the original unquantized model. FlashHead targets the remaining bottlenecks in the output head. Combined with W4A16-Edge2, it further reduces latency while maintaining the same accuracy.

Below is a table of inference speed (tokens per second) over all possible modalities. More details on benchmarks can be found here.

Device

Modality

Original (FP16)

W4A16-Edge2

W4A16-Edge2-FlashHead

Orin Nano Super

text

OOM

56.96

75.33

Orin Nano Super

image

OOM

42.33

51.84

Orin Nano Super

video

OOM

43.62

53.48

AGX Orin

text

45.55

99.84

132.52

AGX Orin

image

26.47

71.94

85.90

AGX Orin

video

39.58

74.93

92.21

AGX Thor

text

65.20

116.43

189.16

AGX Thor

image

52.96

87.42

117.38

AGX Thor

video

56.22

90.03

128.15


How to run it

Recommended runtime path

Test on hardware with minimal setup

docker run --rm -it \

--network host \

--shm-size=8g \

--ulimit memlock=-1 \

--ulimit stack=67108864 \

--runtime=nvidia \

--name=vllm-serve \

-e HF_TOKEN=hf_*** \

-e HF_HOME=/root/.cache/huggingface \

embedl/vllm:latest-jetson-orin-flashhead \

vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \

--max-model-len 8192 \

--gpu-memory-utilization 0.75 \

--max-num-seqs 2 \

--trust-remote-code

 

Why this matters for Physical AI

As physical AI systems move from prototypes to deployed robots and autonomous vehicles, two constraints dominate:

  1. Can it run on target hardware?
  2. Is it responsive enough to be useful in real time?

W4A16‑Edge2 established that Cosmos Reason2 can meet strict edge memory + latency constraints without giving up benchmark-level reasoning quality.

Cosmos Reason2 + FlashHead is the next step: it targets the part of the inference loop that still repeats at every token, the output head, and removes it as a practical bottleneck.

 

Try it now

Hugging Face model link
Edge Inference Benchmarks page


Embedl Models

 

Like it? Share it:

You may also like

Cosmos Reason 2, Quantized for the Edge
Cosmos Reason 2, Quantized for the Edge
13 February, 2026

Today we’re releasing the first quantized version of Cosmos Reason 2, which runs efficiently on the Jetson Nano Super: e...

Cosmos Reason 2 Without the Quantization Trade-Off
Cosmos Reason 2 Without the Quantization Trade-Off
27 February, 2026

We have just released embedl/Cosmos-Reason2-2B-W4A16-Edge2, a new mixed-precision variant of Cosmos Reason 2 that recove...

Blackwell-optimized Cosmos Reason 2
Blackwell-optimized Cosmos Reason 2
25 February, 2026

Today, we are releasing embedl/Cosmos-Reason2-2B-NVFP4A16, a new Blackwell-optimized variant of Cosmos Reason 2. This mo...