Blog | Embedl

Intelligence per Watt: Edge versus Cloud

Written by Embedl | Nov 28, 2025 2:25:46 PM

A new paperfrom a group at Stanford has set up new metrics to evaluate the energy efficiency of AI models: intelligence per watt. How much performance do AI models deliver per unit of energy consumed? They introduce metrics tailored to both edge and cloud devices:

  • For edge devices, they consider local models with at most 20 B parameters, such as QWEN3 and LLaMA3.1, and edge accelerators such as Apple M4 Max and AMD Ryzen AI
  • For cloud, they consider frontier models with over 100 B parameters, such as QWEN3-235B and GPT-OSS-120 B, and cloud accelerators such as NVIDIA H200 and AMD MI300X.

They evaluate the performance of models and hardware across various query sets, ranging from knowledge question answering to reasoning tasks.  For a given model and hardware pair (m, h) and query q, they measure 

  • Accuracy of model m on query q: acc(m, q)
  • Perplexity of model p on query class q: ppl(m,q)
  • Power consumption of model m running on hardware h for query q: P(m,h,q)
  • Latency for model m running on hardware h to generate a response to query q: τ(m,h,q)

Given these measurements, they define power-based,  which measures efficiency relative to instantaneous power draw:

  • Accuracy per watt: acc(m,q)/P(m,h,q) (averaged over query set).
  • Perplexity per watt: 1/(ppl(m,q) P(m,h,q)) (averaged over query set, note that we use inverse of perplexity to get higher scores for better perplexity).

They also define energy-based metrics, which measure efficiency relative to total energy consumed per query:

  • Accuracy per joule: acc(m,q)/(P(m,h,q) τ(m,h,q)) (averaged over query set).
  • Perplexity per joule: 1/ (ppl(m,q) P(m,h,q) τ(m,h,q)) (averaged over query set).

They ask:

  • How has intelligence per watt improved across successive generations of local models and accelerators, and what are the relative contributions of model versus accelerator advances?
  • What fraction of current inference queries can be solved by local LMs on local accelerators, and how has this changed over time?
  • What resource savings (e.g., compute, energy, dollar cost) are possible by distributing workloads across local and cloud infrastructure?

For the first question, the conclusion is that intelligence efficiency is improving rapidly and predictably! From 2023-2025, there was a 5.3× overall increase, with model improvements yielding a 3.1× gain in accuracy per watt while accelerator improvements provided a 1.7× gain. 

For the second question, the study shows that local models on edge devices are fast catching up to frontier models on cloud: 88.7% of queries can be successfully handled by small local models as of October 2025, with coverage varying by domain—exceeding 90% for creative tasks (e.g., Arts & Media) but dropping to 68% for technical fields (e.g., Architecture & Engineering).  Longitudinal analysis shows consistent improvement: the best local LM matched frontier model quality on 23.2% of queries in 2023, 48.7% in 2024, and 71.3% in 2025—a 3.1× increase over two years.

For the third question, the study shows that one can achieve substantial savings by routing queries to local versus cloud infrastructure appropriately: it can reduce energy consumption by 80.4%, compute usage by 77.3%, and cost by 73.8% compared to a cloud-only deployment. Moreover, the routing need not be perfect to realize substantial savings while maintaining task quality: a routing system with 80% accuracy (correctly assigning 80% of queries to local vs. cloud) captures 80% of the theoretical maximum gains, achieving 64.3% energy reduction, 61.8% compute reduction, and 59.0% cost reduction with no degradation in answer quality.

For the moment, cloud devices still maintain an advantage over edge devices in terms of intelligence per watt, due to specially optimized hardware. However, this efficiency disadvantage is offset by complementary system-level benefits: local deployment avoids data center infrastructure costs and enables 88.7% of queries that local models can handle to avoid cloud compute entirely, yielding 60−80% resource reductions. 

Moreover, as we continue to optimize edge devices via Embedl’s technologies, the gap is likely to reduce rapidly. Overall, the study shows that edge devices will be increasingly important for businesses to deploy to realize substantial savings in energy, compute and infrastructure costs. In future posts, we will discuss some of the innovations that Emdedl’s research is bringing to efficient edge AI.

1 https://arxiv.org/pdf/2511.07885