Configuring 100K NVIDIA H200 GPUs Usually Takes Years But Musk Did It In 19 Days

by Zak Killian — Thursday, October 17, 2024, 01:25 PM EDT

NVIDIA CEO Jensen Huang had high praise to offer for Elon Musk and his team at xAI, calling them "superhuman." In an interview with YouTube channel BG2 Pod, Huang remarked that Elon's specific combination of engineering smarts and project management is "singular", and credited both the X owner and NVIDIA's own infrastructure expertise with the incredible achievement of setting up the world's fastest AI-training supercomputing cluster in just nineteen days.

The cluster in question is xAI's Colossus system in Memphis, Tennessee, and it sports one-hundred-thousand Hopper H100 GPUs, making it theoretically the fastest AI training cluster in the world. The "nineteen days" remark is a little misleading, though; that's the time from hardware setup to its first functional use for AI training. The full Colossus project was set up in 122 days from start to finish, according to Musk.

To be clear, though, both of those numbers are unbelievably short time frames. Huang is almost awestruck as he describes X.AI's achievement, which we'll quote here in abridged form (Jensen was rambling a bit):

"From the moment that we decided to go ... to training: 19 days. [...] Do you know how many days 19 days is? It's just a couple of weeks, and the mountain of technology, if you were ever to see it, is just unbelievable. [...] What they achieved is singular; never been done before. Just to put it in perspective, 100,000 GPUs — that's easily the fastest supercomputer on the planet, that one cluster. A supercomputer that you would build would take normally three years to plan, and then they can deliver the equipment, and then it takes one year to get it all working. We're talking about 19 days."

nvidia nvl72 — *One of NVIDIA's Blackwell GB200 NVL72 racks. The entire rack functions as one GPU.*

NVIDIA's CEO is also quick to point out that "networking NVIDIA gear is very different from networking hyperscale datacenters." He goes on to explain that NVIDIA clusters require much more connectivity between nodes than a typical datacenter due to the high-bandwidth nature of GPU compute workloads. Comically, he evocatively describes the process by exclaiming that "the back of the computer's all wires!"

While Musk is involved with supercomputing datacenters across all of his businesses, Colossus is part of xAI, his venture to become a big player in the AI space. The new system absolutely dwarfs the 2,000-GPU AI training cluster at Tesla's Austin, Texas facility—which is obvious, given that it dwarfs almost every supercomputing cluster in the world.

The relevant portion starts at 46:40 or thereabouts, if you're short on time.

The full interview at BG2 Pod is worth a watch if you're interested in AI and the future of NVIDIA. Huang has some pretty interesting ideas about what the next ten years are going to look like. We won't repeat everything he said here, but you can watch the video for yourself above.

Tags: Nvidia, Colossus, xai, Elon Musk, (nasdaq:nvda)

Configuring 100K NVIDIA H200 GPUs Usually Takes Years But Musk Did It In 19 Days

Login with Social Media or Manually