Google has built the world’s fastest machine learning (ML) training supercomputer that broke AI performance records in six out of eight industry-leading MLPerf benchmarks. Using this supercomputer, as well as our latest Tensor Processing Unit (TPU) chip, Google set performance records in six out of eight MLPerf benchmarks, a Google blog said.
Naveen Kumar from Google AI said “We achieved these results with ML model implementations in TensorFlow, JAX, and Lingvo. Four of the eight models were trained from scratch in under 30 seconds.”
Consider that in 2015, it took more than three weeks to train one of these models on the most advanced hardware accelerator available. Google’s latest TPU supercomputer can train the same model almost five orders of magnitude faster just five years later. MLPerf models are chosen to be representative of cutting-edge machine learning workloads that are common throughout industry and academia. The supercomputer Google used for the MLPerf training round is four times larger than the “Cloud TPU v3 Pod” that set three records in the previous competition.
The system includes 4096 TPU v3 chips and hundreds of CPU host machines, all connected via an ultra-fast, ultra-large-scale custom interconnect. In total, this system delivers over 430 PFLOPs of peak performance.
Graphics giant Nvidia said it also delivered the world’s fastest Artificial Intelligence (AI) training performance among commercially available chips, a feat that will help big enterprises tackle the most complex challenges in AI, data science, and scientific computing. Nvidia A100 GPUs and DGX SuperPOD systems were declared the world’s fastest commercially available products for AI training, according to MLPerf benchmarks.
The A100 Tensor Core GPU demonstrated the fastest performance per accelerator on all eight MLPerf benchmarks. The A100, the first processor based on the Nvidia Ampere architecture, hit the market faster than any previous Nvidia GPU.
Google said it’s MLPerf Training v0.7 submissions demonstrate our commitment to advancing machine learning research and engineering at scale and delivering those advances to users through open-source software, Google products, and Google Cloud.