Why GPU-powered AI needs fast storage

Don’t let I/O bottlenecks impede training your AI models

Contributed by IBM Thu 5 Dec 2024 // 15:28 UTC

Advertorial The process of training modern AI models can be incredibly resource-intensive – a single training run for a large language model can require weeks or months of high-performance compute, storage, and networking, even with the parallel processing capabilities of graphic processing units (GPUs).

As a result, many organizations are expanding their compute, storage, and networking infrastructure to keep up with AI-driven demand.

But there's a problem. AI training workloads operate on massive data sets, so it's crucial that the storage system can transfer data fast enough to prevent the GPUs from being starved.

IBM Storage Scale System 6000 has been engineered to address these performance-intensive requirements. It helps speed up data transfer by using the NVIDIA GPUDirect Storage protocol to set up a direct connection between GPU memory and local or remote NVMe or NVMe-oF storage components, removing the host server CPU and DRAM from the data path.

IBM Storage Scale software runs on Scale System 6000 hardware and uses a POSIX-style file system optimized for multi-threaded read-write operations across multiple nodes as an intermediate caching mechanism between the GPUs and object storage. This active file management (AFM) capability is designed to allow the data to be loaded into the GPUs faster whenever a training job is started or restarted, which can become a significant advantage when running AI training workloads.

If a model training process were to be interrupted by a power outage or other error, the entire training run would usually need to be started from scratch. To safeguard against this, the training process stops from time to time to save a checkpoint – a snapshot of the model's entire internal state, including weights, learning rates and other variables – that allows training to be resumed from its last stored state rather than from the beginning.

However, checkpoint storage requirements increase in step with model sizes, and some large language models are trained on literally trillions of tokens. The active file management capabilities in Storage Scale are critical here, enabling training workloads to resume more quickly from the latest checkpoint. For a multi-day or multi-week training run, that can have a major impact.

As organizations build AI-based applications to deliver new kinds of business capabilities, foundation models are likely to continue increasing in complexity and size. That's why GPU clusters need to be paired with data storage systems that won't let I/O bottlenecks impede that progress.

Contributed by IBM.

On-Prem

Storage

Why GPU-powered AI needs fast storage

Don’t let I/O bottlenecks impede training your AI models

Kyndryl's consulting business may be less than it seems

Kyndryl insiders say there's little new business

IBM sued again in storm over Weather Channel data sharing

IBM's mainframe bubble bursts and growth stalls

AlmaLinux shows off its new Kitten

HashiCorp unveils 'Terraform 2.0' while tiptoeing around Big Blue elephant in the room

IBM acquires Indian SaaS startup Prescinto to shine a light on renewable energy assets

IBM: Insurance industry bosses keen on AI. Customers, not so much

Kyndryl follows in IBM's footsteps with rolling layoffs likely affecting thousands

As IBM pushes for more automation, its AI simply not up to the job of replacing staff

NASA, IBM just open sourced an AI climate model so you can fine-tune your own

IBM and Oracle to support 280,000 users after winning mega ERP govt tech contract

About Us

Our Websites

You Privacy