Advanced Computing in the Age of AI | Thursday, March 28, 2024

Is All-Flash Storage Needed for Deep Learning? 

Organizations building deep learning data pipelines may struggle with their accelerated I/O needs, and whenever I/O is the question, the usual answer is “throw flash/SSD at it.” Certainly expensive all-flash storage arrays are highly beneficial for line-of-business applications (and to storage vendors’ sales). But DL applications and workflows are inherently different from typical file-based workloads, and should not be architected the same way.

Let’s start by looking inside those servers. DL uses several hidden layers of neural networks, such as convolutional (CNN), long short-term memory (LTSM), and/or recurrent (RNN). These multiple layers are accommodated by servers configured with multiple GPUs, creating a massively parallel compute layer for training and inferencing.

GPUs currently available for DL training include Tesla V100-based NVIDIA DGX-1, DGX-2 servers or DGX workstations; Cisco UCS C480ML; Dell C4140; or HPE Apollo. Like other AI processing, inferencing can be done with nearly any T4 GPU-based servers.

Each Tesla V100 has around 5000 CUDA cores (threads) and around 600 Tensor cores optimized for floating-point operations, whereas T4 GPUs boast a spec of 2500 CUDA cores and around 300 Tensor cores. Even T4s provide more parallelism than traditional CPUs with only 24 to 28 cores.

Furthermore, each server is equipped with large memory configurations (0.5 to 1 TB) and 16-32 TB of NVMe flash (and sometimes higher), usually clustered to create a distributed cache layer.

In the Groove

Any AI pipeline, including DL, starts with creating the right data sets. Getting there requires various techniques and frameworks, such as Bayesian active learning, and open-source software, such as RAPIDS for indexing, querying, processing, training models and visualization.

RAPIDS, built on Apache Arrow, is a framework that enables the entire pipeline to be done in-memory with GPUs. With this efficiency of data interchange, data only needs to be read once, processed and written to persistent storage when done. DRAM memory and flash is mainly deployed in the compute layer as caching, staging and for RAPIDS-style frameworks. The storage layer is expected to deliver massive throughput to feed the memory and keep GPUs busy.

DL’s workflow stages have varied I/O needs:

  • Ingest from multiple edge devices, such as remote sensors
  • Read, index/contextualize and label data sets, then write them back
  • Train the neural network
  • Inference as the neural net makes predictions against new data sets
  • Manage the lifecycle of data sets

These read/write requirements are quite distinct from typical contention-based workloads (read, modify, write) and distributed file systems or hierarchical namespace requirements.

DL environments also use engines and orchestration layers to manage the workflow, such as Kubeflow, Tensorflow Extended (TFx), NVIDIA GPU Cloud, AWS Sagemaker, or Valohai, along with container-based management, such as Docker, Heptio, Red Hat OpenShift, or others. These applications need to be well integrated with storage as they monitor and optimize ingest, inferencing, versioning, and other stages.

Some workflows will also need scale-out, file-based APIs, such as POSIX NFS, that are object-storage-friendly. Cloud-native workflows and frameworks will support S3 APIs, but file-based data sets need file APIs, such as when a Kubernetes pod needs access to persistent storage. Log-structured file systems capable of scaling to multiple readers and writers, and Container Storage Interface (CSI) plug-ins, establish interoperability for file and object storage platforms.

This is simply to say that DL data pipelines are complex, demanding, and involve an ecosystem of integrated pieces. The storage architecture can add to the challenge or ameliorate it.

Behind the Boxes

The storage infrastructure has to ensure massive parallelism, throughput in the range of hundreds of gigabytes, to keep GPUs busy, and capacity scalability to accommodate large data sets. Given the high cost of flash arrays, it becomes exorbitantly expensive to build a flash tier capable of that scale. A DL data set is likely to be 10-15 petabytes initially, increasing to 50+ petabytes. It is not uncommon to even see hundreds of petabytes. At current street prices of $1.50 per gigabyte, that 15 petabytes of all-flash costs more than $22 million.

Frankly there is no need for it. Flash in the compute layer accommodates the I/O needs of ingest, training, and inferencing. Why duplicate it in the storage layer?

Object storage’s performance can provide throughput, even using traditional spinning disk, and do it economically. Yes, hard disk AFR (annual fail rate) is a consideration, but it is easy to address with storage software features to deliver durability with erasure coding, high availability, and geographic redundancy.

Arrays unfortunately have limited utility with off-premises data. In its paper Three Ways That AI Will Impact Your Data Management and Storage Strategy (Chirag Dekate, Julia Palmer, Arun Chandrasekaran; 27 August 2018), research firm Gartner, Inc. laid out criteria for AI storage and evaluated how some options deliver. In addition to performance, scalability, cost, metadata enrichment and workflow integration, a consideration was “portability” – flexibility to accommodate data at the edge, core, and cloud; ability to burst compute to the cloud; and support for public cloud as well as hybrid public/private and multi-cloud.

Flash storage flops at AI/DL due to choke points, fixed scale, excessive costs, poor workflow integration, poor metadata support and poor portability. (To be fair, Gartner also dinged parallel file systems on most of these counts, including limited portability, because they require custom agents.)

Cloud-native object storage, with an architecture that can deliver highest throughput and concurrency, can effectively meet the requirements of the entire AI/DL data pipeline in a single storage tier, delivering business outcomes and insights faster. That’s the very purpose of DL.

Data engineers can use RAPIDS or distributed caching to their advantage, and use the inherent capabilities of GPU servers, with a high-performing object-based back end to provide massive concurrency, throughput, scale, cloud enablement, metadata management – and save a bundle, too.

Shailesh Manjrekar is head of product and AI/ML solutions at SwiftStack.

EnterpriseAI