💾 How much does it cost to train Giant AI models?

2023 Aug Week 2 Datumo Newsletter

Editor: Jodie Jung

How much does it cost to train Giant AI models?

Andrew Ng explaining the significance of AI. Image. CB Insights.

Many present-day speakers liken AI to infrastructure, much like electricity. Currently, AI technology follows a structure where initial setup costs, similar to the delivery services of Naver and Coupang, are substantial. However, as it gets utilized more, the average cost of production decreases.

To be precise, this explanation pertains to 'giant AI' and 'foundation models,' particularly within the realm of AI. The core of the ongoing wave of generative AI also revolves around the notion that pre-training colossal models with numerous parameters on extensive datasets enables easy adaptation through fine-tuning in specialized domains.

Such well-crafted foundation models have the potential to be applied across various industries. However, they are expensive. As of now, the most familiar foundation models include OPEN AI's GPT, backed by investments exceeding $7.6M from Microsoft, and Google's PaLM series.

So, how much does it cost to train a giant AI like LLM? In essence, you can calculate the cost by understanding 'how long you've used certain GPU instances.' However, in this letter, we will introduce model training, computational volume, GPU-related concepts, and gradually delve into the factors that need to be considered together. :)

Iterating Countless Computations to Find Optimal Parameters

Training Computational Load for Key Models (Unit: petaFLOPs)

Our World in Data, Computation used to train notable artificial intelligence systems

First, we must consider how much computation is required for AI 'training.' Training is the process by which an AI model finds appropriate parameter values. (Parameters determine how the model processes the input information.) Through training, AI iterates by substituting numbers into parameters and performing calculations, gradually reducing the difference between predictions and actual results. This iterative process involves countless computations, and as the number of operations increases, more computing resources are necessary for training.

The unit commonly used to measure the number of operations is FLOPs (Floating point OPerations).

*Note that FLOPs is distinct from FLOPS. Both are in uppercase, but FLOPS stands for "FLoating point OPerations per Second," representing hardware processing speed. In some documents, the term "FLOP" is used to express the number of operations in AI training, instead of "FLOPs."

'PaLM: Scaling Language Modeling with Pathways' Google Research. 22.04

"Pre-training FLOPs Consumption Based on Model Parameter Count."

The number of FLOPs required for model training varies based on factors such as the model's architecture, parameter count, scale of training data, and the number of epochs (iterations of training on the dataset). The structure involves increased computational load with more parameters, larger datasets, and greater repetition during training.

As an example, Google trained their PaLM model with parameters of 8 billion and 540 billion, respectively, using a vocabulary of 78 billion tokens. The FLOPs count for these models is approximately 4.29×10^22 and 2.56×10^24, respectively. Considering that 10^8 is 'billion' and 10^12 is 'trillion,' these numbers appear truly immense. The difference in FLOPs between these two models is roughly 60 times, correlating with their parameter scales.

FLOPs for Training and Inference Across Parameter Count and Training Data Scale

Source: a16z enterprise

GPU Hours, MFU, and Quantization

Hourly Costs Based on GPU Models. For A100 80GB, it's slightly over $2 per hour.

Source: coreweave. gpu-cloud-pricing.

Only a few companies possess computing resources capable of performing such calculations. Many companies acquire computing resources through third-party cloud computing services like Amazon Web Services (AWS) and Microsoft Azure, rather than relying on local hardware or data centers. This approach helps reduce initial investment costs, allows scalability as needed, and alleviates the burden of maintenance. The billing structure of these cloud services follows a pay-as-you-go model, where you pay for what you use.

In this context, the billing unit is not FLOPs but 'GPU hours' and 'GPU days.' These metrics indicate how long a GPU instance has been used. While you can estimate approximate GPU hours based on calculated FLOPs values, a more accurate assessment requires considering the Model FLOPS Utilization (MFU) concept and the model's parameter scale.

'PaLM: Scaling Language Modeling with Pathways' Google Research. 22.04

First, the concept of 'Model FLOPS Utilization' represents the ratio of how well a model utilizes GPU performance during training. A higher MFU leads to reduced hardware operation time and cost. During pre-training of the Google PaLM model, the MFU recorded 46.2%, though generally, MFU tends to be below 50%.

Next, the parameter count influences the choice of GPU memory. Typically, models with more parameters require high-cost GPUs with larger memory for training. However, even with the largest memory chips, it's not always feasible to load all parameters onto a single GPU.

As a result, large-scale AI training employs parallel and distributed computing, along with quantization techniques. Larger models need to be distributed across multiple devices, transitioning into the realm of engineering in designing these computational systems.

For those intrigued by the topics of quantization and computing, I recommend exploring the Hugging Face blog post titled 'Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU' :)

Significantly More Affordable Fine-Tuning, Compared to Pre-Training

Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave

Source: Mosaic ML.

Now, with the foundation laid through the explanations thus far, we can comprehend the table above. This information is sourced from Mosaic ML, renowned for the commercially available open-source LLM 'MPT.' The table illustrates an approximate cost when training a 7B MosaicGPT model with 7 billion parameters on 134 billion tokens.

If NVIDIA A100 is utilized for training, the required GPU hours amount to around 11,462 hours, equating to roughly $25,300. By employing H100, which offers higher performance than A100, both GPU hours and costs decrease. In the table, 'fp8' and 'bf16' represent quantization methods for compressing the model.

'Llama 2: Open Foundation and Fine-Tuned Chat Models' Meta, 23.07.

Let's also take a look at the recently unveiled Llama 2 model card from Meta AI. It's evident that the 70B model's pre-training took 1.72 million GPU hours based on A100-80GB. This is over a hundred times more than the 7B MosaicGPT, signifying training with significantly more parameters on an extensive dataset. The simple extrapolated GPU cost alone exceeds $3.5 million.

-Note: Meta conducted Llama 2 pre-training on their own supercomputer.

"We used custom training libraries, Meta’s Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute" Llama 2: Open Foundation and Fine-Tuned Chat Models

In contrast, fine-tuning requires significantly lower costs. Stanford University mentioned fine-tuning the previous version, Llama 7B, to "Alpaca 7B" only took 3 hours on 8 80GB A100 GPUs." Most cloud computing services offer prices of less than $100 for such operations.

During the fine-tuning phase, the scale of training data is smaller, and not all parameters are trained. This is why many companies opt for fine-tuning or utilize APIs rather than developing models from scratch. Nowadays, articles emphasizing the cultivation of ultra-scale AI from a technological sovereignty and national competitiveness perspective are quite abundant, making for an engaging read personally.

This concludes our letter. Thank you for reading!

References:

* Estimating Training Compute of Deep Learning Models, Jaime Sevilla et al.
Introducing the formula: Computing = Training Time × Number of GPUs/TPUs × Max FLOP/s × Utilization) for estimating training compute.

* Andreessen Horowitz article

Navigating the High Cost of AI Compute, Guido Appenzeller et al.

You can find more professional and extensive content through the provided links.

August Week2 AI News

#1.

Alibaba rolls out open-sourced AI model to take on Meta's Llama 2 (Link)

Alibaba Cloud has announced on the 8th that they have open-sourced large-scale language models with 7 billion parameters, named 'Qwen-7B' and 'Qwen-7B-Chat'.

#2.

CoreWeave, a GPU-focused cloud compute provider, lands $221M investment (Link)

GPU Cloud' startup CoreWeave is expected to generate hundreds of millions of dollars through a partnership with NVIDIA, as cited in an article. You can find CoreWeave's GPU pricing policy outlined here.

#3.

Meta disbands protein-folding team in shift towards commercial AI (Link)

Meta Platforms has disbanded its unprofitable pure science research team, according to recent reports. Mark Zuckerberg, CEO of Meta, has referred to this year as the 'Year of Efficiency,' initiating significant structural adjustments and significantly reducing or dismantling unprofitable ventures amid the reorganization.

#Fine-tune your AI

From data planning, collection, processing, to screening and analysis, we will leverage your existing data as high-quality AI training data.

Implement a generative model optimized for your target function.

The Data for Smarter AI

Datumo is an all-in-one data platform for your smarter AI.

Datumo Inc.

📋 contact@datumo.com

Unsubscribe