Qwen 3.5 is a new generation of large language models designed for high-quality reasoning and multimodal tasks.

FlashHead is Embedl’s drop-in replacement for the output (lm) head in language models. It reduces the cost of the final layer without retraining, improving inference speed while preserving model quality.

This post shows how the combination performs in practice, and why it works.

Results

We applied FlashHead across the Qwen 3.5 family (0.8B → 27B) and validated accuracy on lm-eval benchmarks as well as on-device performance across multiple devices.

*Screenshots from https://github.com/embedl/Edge-Inference-Benchmarks

What actually changes

  • Up to 1.4× lower end-to-end latency
  • Largest gains on smaller and mid-size models (0.8B–9B)
  • Diminishing but still meaningful gains at larger sizes (≥27B)
  • No measurable regression in quality across evals

Where the speedup comes from

On edge hardware (Jetson Nano Super, Jetson AGX Orin, Jetson AGX Thor), the LM head is often a disproportionate bottleneck:

  • Large vocabulary projection
  • Memory-bound operations
  • Poor hardware utilization

FlashHead targets exactly this.

Why this matters for Qwen 3.5

Qwen 3.5 already pushes efficiency hard:

  • Small models punching above their size (e.g. 9B competing with much larger models on reasoning benchmarks)
  • Strong multimodal capability in a relatively compact footprint

That means:

  • You’re more likely to be bottlenecked on inference overheads (like the LM head)
  • Optimizing that layer gives immediate, visible gains

Benchmarks and raw results

📊 https://github.com/embedl/Edge-Inference-Benchmarks

Prebuilt models
🤗 https://huggingface.co/collections/embedl/qwen35

 

Summary

  • FlashHead is a drop-in latency optimization for LLMs
  • Qwen 3.5 + FlashHead delivers faster inference with unchanged quality

If you're running models on edge or latency-constrained setups, this is one of the simplest wins available.

Try it out

FlashHead models for Qwen 3.5 are available now:

🤗 https://huggingface.co/collections/embedl/qwen35

Install FlashHead:

pip install flash-head

Code and integration:
https://github.com/embedl/flash-head

Like it? Share it:

You may also like

Introduction to Deep Learning in the Automotive Industry
Introduction to Deep Learning in the Automotive Industry
9 May, 2023

Deep Learning in the Automotive Industry As the world becomes increasingly digital, the automotive industry is quickly c...

FlashHead for vLLM, made simple
FlashHead for vLLM, made simple
20 April, 2026

Running FlashHead from Embedl with vLLM shouldn’t require any specialized imports or setup procedures. We are excited to...

Lightning-Fast Multimodal Edge Inference with Under 8GB RAM
Lightning-Fast Multimodal Edge Inference with Under 8GB RAM
10 March, 2026

Running advanced multi-modal reasoning models on edge hardware has traditionally required large GPUs and tens of gigabyt...