On-Prem

Storage

Why GPU-powered AI needs fast storage

Don’t let I/O bottlenecks impede training your AI models


Advertorial The process of training modern AI models can be incredibly resource-intensive – a single training run for a large language model can require weeks or months of high-performance compute, storage, and networking, even with the parallel processing capabilities of graphic processing units (GPUs).

As a result, many organizations are expanding their compute, storage, and networking infrastructure to keep up with AI-driven demand.

But there's a problem. AI training workloads operate on massive data sets, so it's crucial that the storage system can transfer data fast enough to prevent the GPUs from being starved.

IBM Storage Scale System 6000 has been engineered to address these performance-intensive requirements. It helps speed up data transfer by using the NVIDIA GPUDirect Storage protocol to set up a direct connection between GPU memory and local or remote NVMe or NVMe-oF storage components, removing the host server CPU and DRAM from the data path.

IBM Storage Scale software runs on Scale System 6000 hardware and uses a POSIX-style file system optimized for multi-threaded read-write operations across multiple nodes as an intermediate caching mechanism between the GPUs and object storage. This active file management (AFM) capability is designed to allow the data to be loaded into the GPUs faster whenever a training job is started or restarted, which can become a significant advantage when running AI training workloads.

If a model training process were to be interrupted by a power outage or other error, the entire training run would usually need to be started from scratch. To safeguard against this, the training process stops from time to time to save a checkpoint – a snapshot of the model's entire internal state, including weights, learning rates and other variables – that allows training to be resumed from its last stored state rather than from the beginning.

However, checkpoint storage requirements increase in step with model sizes, and some large language models are trained on literally trillions of tokens. The active file management capabilities in Storage Scale are critical here, enabling training workloads to resume more quickly from the latest checkpoint. For a multi-day or multi-week training run, that can have a major impact.

As organizations build AI-based applications to deliver new kinds of business capabilities, foundation models are likely to continue increasing in complexity and size. That's why GPU clusters need to be paired with data storage systems that won't let I/O bottlenecks impede that progress.

Contributed by IBM.

Send us news

Kyndryl's consulting business may be less than it seems

Insiders say it's largely a matter of labeling

Kyndryl insiders say there's little new business

IT giant aims to end revenue slide by March, former employees have doubts

IBM sued again in storm over Weather Channel data sharing

Privacy lawsuit blows this way alleging disclosure of names, email addresses, geo info, video titles without permission

IBM's mainframe bubble bursts and growth stalls

Red Hat still glowing, but Big Blue's been bruised by investors

AlmaLinux shows off its new Kitten

Also there's a beta of 9.5 – which is more than there is of RHEL

HashiCorp unveils 'Terraform 2.0' while tiptoeing around Big Blue elephant in the room

HashiConf shindig oddly reluctant to mention impending IBM acquisition

IBM acquires Indian SaaS startup Prescinto to shine a light on renewable energy assets

Also: Crypto-hub Binance helps Delhi police shut down solar power scam

IBM: Insurance industry bosses keen on AI. Customers, not so much

Fewer than 30% of clients happy dealing with a generative AI virtual agent

Kyndryl follows in IBM's footsteps with rolling layoffs likely affecting thousands

Underutilized staff get sent to the 'bench' – and seldom return

As IBM pushes for more automation, its AI simply not up to the job of replacing staff

So say our sources, who warn job cuts, outsourcing risk depriving biz of seasoned technical talent

NASA, IBM just open sourced an AI climate model so you can fine-tune your own

Prithvi, Prithvi, Prithvi good

IBM and Oracle to support 280,000 users after winning mega ERP govt tech contract

Pair of industry giants set to take on £711M upgrade supporting four UK departments