Why DeepSeek changes the game, but doesn’t break it

At least it's not from Wuhan...

29 January 2024 · Hashrate 7-Day SMA: 777 EH/s · Hashprice: $59/PH/Day

We’re jumping into the DeepSeek pond with everyone else. Our Editor-in-Chief turned Editor-in-LLMs breaks down the DeepSeek story, ending with some thoughts on what it means for infrastructure.

It’s about a 10 minute read (and worth every second of your time).

Join mining companies like Luxor and CleanSpark at OPNEXT, the Bitcoin scaling conference this April at MicroStrategy HQ.

AI just had its “Sputnik moment.” 

Last week, Chinese-large language model (LLM) startup DeepSeek emerged from stealth, taking U.S. markets by surprise.

DeepSeek is faster, smarter, and leaner than other LLMs like ChatGPT. For everything from content creation to basic queries, it’s quicker than its predecessors. And most importantly, the model can “think for itself,” and by consequence, it’s reportedly cheaper to train than models that came before it.

Sounds great, right? That is, unless you’re an American tech company with all of your chips on this country’s AI industry. Markets had a meltdown on Monday in response to the advancement. Tech stocks collectively shed over $1 trillion in market cap–half of Bitcoin’s marketcap. Nvidia alone fell 17% and lost $589 billion in value–the largest single-day loss in the history of the U.S. stock market. Losses from Nvidia and other stocks dragged on the Nasdaq Composite Index, which fell 3.1% on the day.

And the hemorrhage wasn’t contained to tech stocks. Energy stocks bled, as well, with Vistacorp, a natural gas, nuclear, and renewable power company with heavy operations in Texas, falling roughly 30%, while Constellation Energy, a power company that is restarting Three Mile Island to service a Microsoft data center, declined more than 20%.

The market’s fear with DeepSeek is simple: efficiency gains in LLM computing are coming quicker than expected, with the consequence of the market needing fewer GPUs, data centers, and less energy to feed the AI growth spurt. Coincidentally, the model went viral just days after President Trump announced the $500 billion Project Stargate initiative to accelerate AI infrastructure build outs in the U.S.

Talking head opinions are split on whether this is catastrophic or bullish for AI. There’s a case to be made that the advancement fuels growth instead of extinguishing it (for example, car engine efficiency improvements increased demand for cars).

But the figure that’s floating around social media for how much DeepSeek costs to train is also misleading. The model will cut costs, but not as dramatically as some might think. 

Understanding DeepSeek

Chinese engineer Liang Wenfeng founded DeepSeek in May 2023, with backing from hedge fund High-Flyer, another Wenfeng company founded in 2016. DeepSeek open sourced its first model, DeepSeek-R1, on January 20, and it started making waves online last weekend. 

DeppSeek-R1 has a number of features that distinguish it from other models, including:

  • Semantic Contextualization: DeepSeek can read between the lines, so to speak. It uses what’s known as “semantic embeddings” to divine the intent and deeper context behind queries which allows for more nuanced and incisive responses. 

  • Cross-Modal Search: It can interpret and cross-analyze different media, meaning it can digest text, images, videos, audio, etc. simultaneously.

  • Automatic Adaptation: DeepSeek learns and retrains as it goes along – the more data we feed it, the more it adapts, which could make it more reliable without needing frequent retraining. Put differently, we may not need to feed data to models like we did in the past, as they can learn, retrain on the go.

  • Mass Data Processing: DeepSeek can reportedly handle petabytes of data, making it ideal for data sets that may have been too unwieldy for other LLMs.

  • Fewer Parameters: DeepSeek-R1 has 671 parameters in total, but it only requires 37 billion parameters on average for each output, versus an estimated 500 billion to 1 trillion per output for ChatGPT (OpenAI has not disclosed this figure. Parameters are the inputs and components fed to a model during training to direct and refine its learning).

There are a few others, but those are the big ones. The self-adjusting, learning-and-adjusting-as-it-goes feature is a huge selling point; it unlocks a new level of LLM self-directed reasoning that not only saves time and resources, but also opens the door to more effective AI agents that could be used as the basis of autonomous AI systems for robotics, self-driving cars, logistics, and other industries. 

Not to belabor the point too much, but Pastel Founder and CEO Jeffrey Emmanuel sums up this breakthrough nicely in his article “The Short Case for Nvidia Stock.”

“With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.”

Brought to you by Luxor

Get game-changing mining results with Luxor Firmware. Boost hashrate, cut energy costs, protect your hardware, and maximize mining profits with LuxOS.

Click to check out LuxOS!

Why DeepSeek has Wall Street shook

Ok, so DeepSeek is a bigger, better version of ChatGPT, but that’s not what really spooked the suits last week – the reported cost of the model did.

The team self-reported that the model only cost $5.6 million to train a suspect metric.

Breaking it down by GPU hour (a measure for the cost of computing power per GPU per hour of uptime), the Deep Seek team claims they trained their model with 2,048 Nvidia H800 GPUs over 2.788 million GPU hours for pre-training, context extension, and post training at $2 per GPU hour. 

By contrast, OpenAI CEO Sam Altman said that GPT-4 cost over $100 million to train. This involved 90-100 days of training on 25,000 Nvidia A100 GPUs for a total of 54 to 60 million GPU hours at an estimated cost of $2.50-$3.50 per GPU hour. 

So DeepSeek’s sticker price for training compared to OpenAI’s own is what sent markets into a frenzy on Monday. Investors asked themselves: if DeepSeek can create a better LLM than OpenAI at a fraction of the cost, then why are we spending billions in America to build beaucoups of infrastructure we were told was necessary to make all of this newfangled cyber-wizardry work? And what does this mean for the ROI and profitability of AI/HPC data centers? 

The chart below, showing data center revenue per GW to train DeepSeek and ChatGPT, illustrates the point. 

The problem, though, is that we’re not actually certain that DeepSeek trained its model so cheaply.

The real cost of DeepSeek

Some onlookers are not convinced that DeepSeek was so cheap to stand up, and with good reason. 

To start, in its whitepaper, the DeepSeek team clarifies that the training “costs include only the official training of DeepSeek-V3,” not “the costs associated with prior research and ablation experiments on architectures, algorithms, or data.” Put another way, the $5.6 million is for the final training run, but more went into refining the model. 

As such, the $5.6 million cost is “*deeply* misleading,” claims Atreides Management CIO Gavin Baker, because it does not include prior research and development.

“This means that it is possible to train an r1 quality model with a [$5.6m] run if a lab has already spent hundreds of millions of dollars on prior research and has access to much larger clusters. Deepseek obviously has way more than 2048 H800s; one of their earlier papers referenced a cluster of 10k A100s.  An equivalently smart team can’t just spin up a 2000 GPU cluster and train r1 from scratch with [$5.6m],” he wrote in a tweet.

Further, Baker points out that DeepSeek leaned on ChatGPT through a process called “distillation,” where an LLM team uses another model to train its own. “it is unlikely they could have trained this without unhindered access to GPT-4o and o1,” Baker said.

DeepSeek, gas guzzling, and Jevon’s Paradox

While it’s dubious that DeepSeek cost $5.6 million to train, Baker points out that the model’s breakthroughs – self-learning, fewer parameters, etc – do mean that DeepSeek was cheaper to train and cheaper to use (what’s known as “inference” in industry parlance). 

Baker claims that it costs “93% less to use [DeepSeek-R1] than [ChatGPT’s] o1 per each API.” Whether or not 93% is exact is irrelevant, because the model will make inference cheaper and it can even be run locally on hardware like a Mac Studio Pro. This is the real breakthrough with DeepSeek – that AI will be cheaper to use. As one anon put it, it feels similar to when Microsoft open sourced the internet browser, destroying Netscape’s pay-for-access model. 

DeepSeek flung the doors open to an entirely new modality for AI, one where “the battle of usage is now more about AI inference vs Training,” to take a line from Chamath Palihapitiya.

So what does this mean for the AI-sparked data center and power plant boom? 

As we floated earlier in the article, have more efficient engines dampened the demand for gasoline, or negatively impacted industries that rely on vehicles? Jevons Paradox stipulates that, as technological advancements allow for more efficient use of resources, demand for those resources increases as they become cheaper. Bitcoin miners know the effects all too well; ASIC miner power efficiency has improved year-over-year, and with advancement, hashrate has only grown. 

So as far as we can tell, a more powerful competitor may have entered the playing field, but the game hasn’t changed. If AI inference and training costs lower (which they were always going to eventually), this will unlock more applications and furnish greater demand.

Recent Mining Pods

Enjoyed today's read?

Tell us if you liked the newsletter by clicking on one of the answers below!

Login or Subscribe to participate in polls.