MLPerf Releases AI Storage v2.0 Benchmark Results

MLCommons has announced results for its MLPerf Storage v2.0 benchmark suite, designed to measure the performance of storage systems for machine learning workloads in an architecture-neutral, representative, and reproducible manner. According to MLCommons, the results show that storage systems performance continues to improve rapidly, with tested systems serving roughly twice the number of accelerators than in the v1.0 benchmark round.

To view the results visit the Storage benchmark results.

The v2.0 benchmark adds new tests that replicate checkpointing for AI training systems. The benchmark results provide information for stakeholders who need to configure the frequency of checkpoints to optimize for high performance – particularly at scale.

As AI training systems have continued to scale up to billions and even trillions of parameters, and the largest clusters of processors have reached one hundred thousand accelerators or more, system failures have become a prominent technical challenge. Because data centers tend to run accelerators at near-maximum utilization for their entire lifecycle, both the accelerators themselves and the supporting hardware (power supplies, memory, cooling systems, etc.) are heavily burdened, minimizing their expected lifetime. This is a chronic issue, especially in large clusters: if the mean time to failure for an accelerator is 50,000 hours, then a 100,000-accelerator cluster running for extended periods at full utilization will likely experience a failure every half-hour. A cluster with one million accelerators would expect to see a failure every three minutes. Worse, because AI training usually involves massively parallel computation where all the accelerators are moving in lockstep on the same iteration of training, a failure of one processor can grind an entire cluster to a halt.

It is now broadly accepted that saving checkpoints of intermediate training results at regular intervals is essential to keep AI training systems running at high performance. The AI training community has developed mathematical models that can optimize cluster performance and utilization by trading off the overhead of regular checkpoints against the expected frequency and cost of failure recovery (rolling back the computation, restoring the most recent checkpoint, restarting the training from that point, and duplicating the lost work). Those models, however, require accurate data on the scale and performance of the storage systems that are used to implement the checkpointing system.

The MLPerf Storage v2.0 checkpoint benchmark tests provide precisely that data, and the results from this round suggest that stakeholders procuring AI training systems need to carefully consider the performance of the storage systems they buy, to ensure that they can store and retrieve a cluster’s checkpoints without slowing the system down to an unacceptable level. For a deeper understanding of the issues around storage systems and checkpointing, as well as of the design of the checkpointing benchmarks, we encourage you to read this post from Wes Vaske, a member of the MLPerf Storage working group.

“At the scale of computation being implemented for training large AI models, regular component failures are simply a fact of life,” said Curtis Anderson, MLPerf Storage working group co-chair. “Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance. This initial round of checkpoint benchmark results shows us that current storage systems offer a wide range of performance specifications, and not all systems are well-matched to every checkpointing scenario. It also highlights the critical role of software frameworks such as PyTorch and TensorFlow in coordinating training, checkpointing, and failure recovery, as well as some opportunities for enhancing those frameworks to further improve overall system performance.”

Continuing from the v1.0 benchmark suite, the v2.0 suite measures storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators’ “think time” the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

The v2.0 results show that submitted storage systems have substantially increased the number of accelerators they can simultaneously support, roughly twice the number compared to the systems in the v1.0 benchmark.

“Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It’s no surprise to see that storage system providers are innovating to support ever larger scale systems,” said Oana Balmau, MLPerf Storage working group co-chair.

The v2.0 submissions also included a much more diverse set of technical approaches to delivering high-performance storage for AI training, including:

6 local storage solutions;

2 solutions using in-storage accelerators;

13 software-defined solutions;

12 block systems;

16 on-prem shared storage solutions;

2 object stores.

“Necessity continues to be the mother of invention: faced with the need to deliver storage solutions that are both high-performance and at unprecedented scale, the technical community has stepped up once again and is innovating at a furious pace,” said Balmau.

The MLPerf Storage benchmark was created through a collaborative engineering process by 35 leading storage solution providers and academic research groups across 3 years. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v2.0 benchmark results, from a broad set of technology providers, reflect the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v2.0 includes >200 performance results from 26 submitting organizations: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong. The submitters represent seven different countries, demonstrating the value of the MLPerf Storage benchmark to the global community of stakeholders.

“The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,” said David Kanter, Head of MLPerf at MLCommons. The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it. I would especially like to welcome first-time submitters Alluxio, ExponTech, FarmGPU, H3C, Kingston, KIOXIA, Oracle, Quanta Computer, Samsung, Sandisk, TTA, UBIX, IBM, and WDC.”

“This level of participation is a game-changer for benchmarking: it enables us to openly publish more accurate and more representative data on real-world systems,” Kanter continued. That, in turn, gives the stakeholders on the front lines the information and tools they need to succeed at their jobs. The checkpoint benchmark results are an excellent case in point: now that we can measure checkpoint performance, we can think about optimizing it.”

Source: https://insideainews.com/

What's Hot

MLPerf Releases AI Storage v2.0 Benchmark Results

Related Posts

AI University

AI Tools & Apps Directory

Info

Subscribe to Updates