Configuring 100K NVIDIA H200 GPUs Usually Takes Years But Musk Did It In 19 Days
The cluster in question is xAI's Colossus system in Memphis, Tennessee, and it sports one-hundred-thousand Hopper H100 GPUs, making it theoretically the fastest AI training cluster in the world. The "nineteen days" remark is a little misleading, though; that's the time from hardware setup to its first functional use for AI training. The full Colossus project was set up in 122 days from start to finish, according to Musk.
To be clear, though, both of those numbers are unbelievably short time frames. Huang is almost awestruck as he describes X.AI's achievement, which we'll quote here in abridged form (Jensen was rambling a bit):
"From the moment that we decided to go ... to training: 19 days. [...] Do you know how many days 19 days is? It's just a couple of weeks, and the mountain of technology, if you were ever to see it, is just unbelievable. [...] What they achieved is singular; never been done before. Just to put it in perspective, 100,000 GPUs — that's easily the fastest supercomputer on the planet, that one cluster. A supercomputer that you would build would take normally three years to plan, and then they can deliver the equipment, and then it takes one year to get it all working. We're talking about 19 days."
While Musk is involved with supercomputing datacenters across all of his businesses, Colossus is part of xAI, his venture to become a big player in the AI space. The new system absolutely dwarfs the 2,000-GPU AI training cluster at Tesla's Austin, Texas facility—which is obvious, given that it dwarfs almost every supercomputing cluster in the world.
The full interview at BG2 Pod is worth a watch if you're interested in AI and the future of NVIDIA. Huang has some pretty interesting ideas about what the next ten years are going to look like. We won't repeat everything he said here, but you can watch the video for yourself above.