Deep Learning’s Sputnik Moment?

The DeepSeek Shock

Shockwaves were sent around Silicon Valley and the stock market following the release of new AI models from a hitherto unknown Chinese company called DeepSeek, on the same day as the inauguration of President Trump. The Financial Times reported it as follows:

Who are DeepSeek?

DeepSeek is a Chinese startup founded in May 2023 by Liang Wenfeng, a prominent figure in both the hedge fund and AI industries, DeepSeek operates independently but is solely funded by High-Flyer, a quantitative hedge fund also founded by Wenfeng. DeepSeek's team of about 200 employees primarily consists of young, talented graduates from top Chinese universities, fostering a culture of innovation and high tech development.

DeepSeek's journey began with the release of DeepSeek Coder in November 2023, an open-source model designed for coding tasks. This was followed by DeepSeek LLM, a 67B parameter model aimed at competing with other large language models. DeepSeek-V2, launched in May 2024, gained significant attention for its strong performance and low cost, triggering a price war and a disruptive pricing strategy forcing other AI companies to lower their costs.

DeepSeek-V2 was succeeded by DeepSeek-Coder-V2, a more advanced model with 236 billion parameters designed for complex coding challenges.

The company's latest models, DeepSeek-V3 and DeepSeek-R1, have been the reason behind the big storm in the stock market. DeepSeek-V3, a 671B parameter model, boasts impressive performance on various benchmarks while requiring significantly fewer resources than its peers. DeepSeek-R1, released in January 2025, focuses on reasoning tasks and challenges OpenAI's o1 model with its advanced capabilities.

DeepSeekR1

DeepSeek appears to have used a range of techniques and very efficient engineering to achieve their results. Details are still a bit sparse in their report, but we can note the following key components:

Reinforcement Learning: DeepSeek leverages reinforcement learning, allowing models to learn through chain-of-thought (CoT) and self-improve through algorithmic rewards. This approach has been particularly effective in developing DeepSeek-R1’s reasoning capabilities.

Mixture-of-Experts Architecture: DeepSeek’s models utilize a mixture-of-experts architecture, activating only a small fraction of their parameters for any given task. This selective activation significantly reduces computational costs and enhances efficiency.

Multi-Head Latent Attention: DeepSeek-V3 incorporates multi-head latent attention and this enhanced attention mechanism contributes to DeepSeek-V3’s impressive performance on various benchmarks.

A visualization of the whole training pipeline due to Harris Chan. As we study and learn more, we will gain a better understanding of the techniques.

^1. https://api-docs.deepseek.com/news/news250120
^2.https://x.com/SirrahChan/status/1881488738473357753

Small is Beautiful

DeepSeek also offers a range of smaller models known as DeepSeek-R1-Distill, which are based on popular open-weight models like Llama and Qwen. They produced these by using a technique called knowledge distillation to transfer knowledge from its large R1 model to smaller models such as in the Llama3 family. These distilled models provide varying levels of performance and efficiency, catering to different computational needs and hardware configurations. These smaller models have shown impressive performance compared to the o1-mini model from OpenAI.

Making it Efficient Again

The approach of producing smaller more efficient AI models has been pursued for long by our engineers and researchers at Embedl. Not only do we use more sophisticated variants of knowledge distillation, but we employ the full complement of compression techniques in our prize winning SDK, including pruning, quantization and neural architecture search.

DeepSeek has demonstrated that we can significantly reduce the computational resources required for training and deploying AI models, resulting in lower costs. DeepSeek-V3, for example, was trained for a fraction of the cost of comparable models from the US giants like Meta with a reported figure of $5.5 million compared to tens or hundreds of billions of dollars. DeepSeek’s API pricing is significantly lower too, making its models accessible to smaller businesses and developers who may not have the resources to invest in expensive proprietary solutions. For instance, DeepSeek-R1’s API costs just $0.55 per million input tokens and $2.19 per million output tokens, compared to OpenAI’s API, which costs $15 and $60, respectively.

DeepSeek did not have access to the latest generation of Nvidia GPUs due to US Government restrictions. They used instead a few thousand crippled “Hopper” H800 GPU accelerators from Nvidia, which have some of their performance capped. The cluster that DeepSeek says that it used to train the V3 model had a mere 256 server nodes with eight of the H800 GPU accelerators each, for a total of 2,048 GPUs.

Embedl too has demonstrated impressive performance on much cheaper hardware. Compared to DeepSeek, Embedl has demonstrated this across a much more comprehensive range of commercially available hardware: besides CPUs and GPUs, also on other hardware platforms such as Qualcomm Snapdragon, Texas Instruments, Ambarella, ARM, Intel and STMicroelectronics.

Open Innovation

Former Intel CEO Pat Gelsinger posted on X

Wisdom is learning the lessons we thought we already knew. DeepSeek reminds us of three important learnings from computing history:

1) Computing obeys the gas law. Making it dramatically cheaper will expand the market for it. The markets are getting it wrong, this will make AI much more broadly deployed.

2) Engineering is about constraints. The Chinese engineers had limited resources, and they had to find creative solutions.

3) Open Wins. DeepSeek will help reset the increasingly closed world of foundational AI model work. Thank you DeepSeek team.

Ethan Mollick at Wharton’s School of Management at the University of Pennsylvania reacted by exclaiming: " More efficient models mean … [we] will be able to use it to serve more customers and products at lower prices & power impact”!

Many have described it as the “Sputnik moment” in AI for the west, this time with China as the rival. Indeed it is time for investors to wake up and be the champions for Embedl to take the lead in this race!

^3. https://x.com/PGelsinger/status/1883896837427585035

Blog-

Deep Learning’s Sputnik Moment?

Like it? Share it:

You may also like

Challenges Faced by R&D Teams in Deep Learning Development

Battle of the SLMs: Gemma vs LLama

Revolutionize Your Embedded Systems