📢 Gate Square #MBG Posting Challenge# is Live— Post for MBG Rewards!
Want a share of 1,000 MBG? Get involved now—show your insights and real participation to become an MBG promoter!
💰 20 top posts will each win 50 MBG!
How to Participate:
1️⃣ Research the MBG project
Share your in-depth views on MBG’s fundamentals, community governance, development goals, and tokenomics, etc.
2️⃣ Join and share your real experience
Take part in MBG activities (CandyDrop, Launchpool, or spot trading), and post your screenshots, earnings, or step-by-step tutorials. Content can include profits, beginner-friendl
H100 Supply and Demand Analysis: How long will the chip war last?
Author: Clay Pascal
Compiled by: wenli, Lavida, yunhao
Recommended by: Cage, Huaiwei
Source: Overseas Unicorns
The breakthrough of large models is based on the improvement of hardware computing power and cloud computing capabilities. NVIDIA H100, which is regarded as the GPU "nuclear bomb", is facing the most serious shortage in history. Sam Altman directly stated that the shortage of GPUs limits the speed of OpenAI's technology upgrades in terms of fine-tuning, dedicated capacity, 32K context windows, and multimodality.
This article is compiled from GPU Utils. The author mainly discusses how long GPUs (especially NVIDIA H100) will last from the perspective of supply and demand.
From the perspective of demand, NVIDIA H100 is undoubtedly a rigid demand for training large models. According to estimates, the current demand for H100 in the market is about 432,000 sheets, which is equivalent to a total value of about 35,000 US dollars per sheet. At $15B GPU**, the 432k figure does not include companies like ByteDance (TikTok), Baidu, and Tencent that need a lot of H800.
On the supply side, the shortage of H100 is directly limited by TSMC's production capacity, and in the short term, NVIDIA has no other alternative chip factories. Because of the limited shipments, NVIDIA also has its own strategy on how to allocate these GPUs. For NVIDIA, how to ensure that these limited GPUs flow to AI dark horses rather than potential competitors such as Google, Microsoft, and AWS is very important.
How long will this AI arms race around the H100 last? The answer is not yet clear. Although NVIDIA said that it will increase supply in the second half of the year, it seems that the shortage of GPUs may continue until 2024.
Surrounding the shortage of H100, the market may enter a "vicious cycle": the scarcity causes GPU capacity to be regarded as a moat for AI companies, which leads to more GPU hoarding, which further intensifies GPU's shortage. scarcity.
**The following is the table of contents of this article, and it is recommended to read it in combination with the main points. **
👇
01 background
02 Requirements analysis for H100
03 H100 Supply side analysis
04 How to get H100
05 Summary
01.Background
Until August 2023, the development of the field of artificial intelligence has been constrained by the bottleneck of GPU supply.
“One of the reasons the AI boom is underestimated is the GPU/TPU shortage. The shortage of GPUs and TPUs limits the speed of product introduction and model training progress, but these constraints are hidden. We are mainly seeing NVIDIA’s stock price soaring, not R&D progress is constrained. Things will improve when supply and demand are balanced.
—Adam D'Angelo, CEO of Quora, Poe.com, former Facebook CTO
Sam Altman said that the shortage of GPUs has limited the progress of OpenAI projects, such as fine-tuning, dedicated capacity, 32K context windows, multi-modality, etc.
Large scale H100 clusters of small and large cloud providers are running out of capacity.
"Everybody wants NVIDIA to make more A/H100s."
"Due to the current GPU shortage, it is better for OpenAI that fewer people use our products";
"We'd actually be happy if people used OpenAI products less because we don't have enough GPUs."
—Sam Altman, CEO, OpenAI
On the one hand, Sam Altman's words subtly show that OpenAI's products have been loved by users all over the world, but at the same time, it also illustrates the fact that OpenAI does need more GPUs to further promote and upgrade its functions.
Azure and Microsoft are also facing a similar situation, and an anonymous person mentioned:
• The company is restricting employees from using GPUs, and everyone has to queue up to apply for computing power like college students in the 1970s in order to use computers. From my point of view, OpenAI is currently sucking up all GPU resources;
• In June this year, the cooperation between Microsoft and CoreWeave is essentially to enhance Microsoft's GPU/computing power supply.
CoreWeave :
Cloud computing power service providers, according to CoreWeave's official website, their services are 80% cheaper than traditional cloud computing vendors. In April 2023, CoreWeave received NVIDIA's B-round investment and obtained a large number of new H100 cards. In June, Microsoft also signed an agreement with CoreWeave. Microsoft will invest billions of dollars in the next few years for cloud computing infrastructure construction .
In July, CoreWeave launched the world's fastest AI supercomputer project in partnership with NVIDIA, and Inflection AI created one of the world's most complex large-scale language models on the CoreWeave Cloud using infrastructure that supports MLPerf submissions. In addition, CoreWeave used the NVIDIA H100 accelerator card in its hands as collateral, and announced in August that it had completed a $2.3 billion debt financing.
To sum up, the supply of H100 GPUs is already quite short. There are even rumors that **Azure and GCP are practically running out of capacity, and AWS is running out of capacity. **
The reason for the shortage is that NVIDIA only supplies so many H100 GPUs to these cloud providers. As NVIDIA's H100 GPU output cannot meet the demand, the computing power that these cloud providers can provide will naturally begin to be in short supply.
If you want to understand the bottleneck of computing power, you can focus on the following questions:
• What are the specific reasons for this situation? :
How big is the demand? For example, in which fields the demand for artificial intelligence is increasing relatively rapidly;
How big is the supply? Whether the production capacity of GPU manufacturers such as NVIDIA is sufficient to meet demand;
• How long will this shortage last? When will the supply and demand of GPUs gradually reach an equilibrium point?
• What are the ways in which this shortage can be effectively alleviated?
02.H100 Requirements Analysis
Analyze the key issues of computing power bottlenecks from the demand side:
Specifically, what is it that people want to buy but have trouble getting?
How big is the demand for GPU in the current market?
Why do businesses prefer NVIDIA H100 over different GPUs?
What types of GPUs are currently on the market?
Where can enterprises buy GPUs? What are their prices?
**Who are the demanders of H100? **
Enterprises with a demand of more than 1,000 H100 or A100:
• Startups training LLM:
OpenAI (via Azure), Anthropic, Inflection (via Azure and CoreWeave), Mistral AI;
• Cloud Service Providers (CSPs):
In addition to the three giants of Azure, GCP, and AWS, there are also Oracle, and GPU cloud providers such as CoreWeave and Lambda;
• Other tech giants:
For example, Tesla (**picking note: **Meta, Apple and other giants that the original author did not mention here also have a lot of demand for GPUs, Google mainly uses TPU to process calculations, and the demand for H100 is mainly Google Cloud Platform) .
In addition to the above companies, if the company needs to make a lot of fine-tuning of LLM, it also needs to reserve at least 100 H100 or A100.
For companies using private clouds (CoreWeave, Lambda), and companies with hundreds to thousands of H100 stocks, they are almost mainly faced with the work of LLM and some diffusion models (Diffusion Model). Some companies choose to fine-tune existing models, but more AI startups are building new large models of their own from scratch. **These companies typically sign contracts with private cloud service providers in the $10-50 million range for 3 years and use a few hundred to a few thousand GPUs. **
For companies that only use a small number of on-demand H100 GPUs, LLM-related tasks make up a large portion of their GPU usage, and LLM can use more than 50% of the GPU.
Currently, private clouds are being favored by enterprises, and although these enterprises usually choose the default large cloud service providers, they also face the risk of being eliminated.
**• Are large AI labs more constrained by inference tasks or training tasks? **
This question depends on how attractive their product is. In other words, the attractiveness of the company's products is very important in determining resource allocation. In the case of limited resources, the priorities of reasoning and training often have their own emphasis. Sam Altman's view is that if a choice must be made, OpenAI is more inclined to enhance reasoning capabilities, but currently OpenAI is limited in both aspects.
Why H100 is just needed for training LLM
Most of the current market uses NVIDIA H100 GPUs. This is because the H100 GPU is the fastest in terms of LLM inference and training, and it also has the best inference cost performance. Specifically, most enterprises choose to use the 8-GPU HGX H100 SXM server.
According to my analysis, for the same job, H100 is more advantageous in terms of cost. The V100 GPU is a good option if you can find a used unit, but that's often not possible.
—— an anonymous person
In terms of inference, we found the A10G GPU to be more than adequate and much less expensive.
—— A private cloud executive
We noticed that the Falcon 40b and llama2 70b are also being heavily used, where this statement is no longer accurate. Therefore, interconnection speed is very important for inference tasks.
— (Another) Private Cloud Executive
Falcon 40b :
Falcon is a basic large language model with 40 billion parameters, Falcon 40b aims to use less training computing power to achieve better results, the model accounts for only 75% of GPT-3 training computing, 40% of Chinchilla and PaLM-62B 80% of training. On May 25, 2023, the UAE Institute of Technology Innovation announced that it would open source Falcon 9 for research and commercial use. After its release, it once topped the Hugging Face open source LLM list.
**• What are the common needs of LLM entrepreneurial teams? **
**For LLM startups, they often choose H100 GPU with 3.2Tb/s InfiniBand for LLM training. Although almost everyone prefers the H100 in the training session, in the inference session, these companies pay more attention to cost performance, that is, the performance created per dollar. **
There are still some issues with the performance per dollar of the H100 GPUs compared to the A100, but the H100s are still preferred because of their better scaling, and faster training times, while speed/compression starts, trains or improves Model timing is critical for startups.
"For multi-node training, they all require an A100 or H100 GPU with InfiniBand networking. The only non-A/H100 requirement we observed was for inference, where the workload was single GPU or single node."
—— A private cloud executive
The main factors affecting LLM training are:
**• Memory bandwidth: **In the face of a large amount of data loaded from memory, higher memory bandwidth can speed up data loading;
**• Model computing power (FLOPS, floating point operations per second): ** Tensor kernel or equivalent matrix multiplication unit, which mainly affects the calculation speed;
**• Cache and cache latency: **The cache can temporarily store data for repeated access, which has a significant impact on performance;
**• Additional features: **Such as FP8 (8-bit floating-point number), etc., low-precision numeric formats can speed up training and inference;
**• Computing performance: ** is related to the number of GPU CUDA cores, and mainly affects the number of tasks that can be executed in parallel;
**• Interconnection speed: **For fast inter-node interconnection bandwidth such as InfiniBand, this factor will affect the speed of distributed training.
**H100 is preferred over A100 due in part to H100's lower cache latency and FP8 compute capability. **
The H100 is really the first choice as it is up to 3x more efficient than the A100 but costs only 1.5 - 2x the A100. How to consider the cost of the whole system, the performance per dollar of the H100 is also much higher, if you consider the system performance, the performance per dollar may be 4-5 times higher.
—— A deep learning researcher
**Why is numerical precision so important? **
Low-precision floating-point numbers can improve training and inference speed. For example, FP16 has half the memory footprint of FP32 and is three times faster than FP32 in terms of calculation speed. In the LLM training process, in order to ensure the balance between speed and precision, methods such as mixed precision and adaptive precision are used to accelerate large language models. Therefore, multiple precision support is one of the important considerations for large language model training. Google proposed the BFP16 numerical format, which expands the numerical range while reducing the precision, and the performance is better than FP 32.
**• Besides the GPU, what are the cost links in LLM training and operation? **
GPU is currently the most expensive component in the entire LLM training infrastructure, but other aspects of the cost are not low, which also have an impact on the training and operating costs of LLM:
System memory and NVMe SSDs are expensive: Large models require a lot of high-speed memory and high-speed SSDs to cache and load data, and both components are expensive;
High-speed networks are expensive: High-speed networks such as InfiniBand (used for communication between nodes) are very expensive, especially for large, distributed training.
Perhaps 10%-15% of the total cost of running a cluster goes to electricity and hosting, split roughly evenly between the two. Electricity costs include electricity, data center construction costs, land costs and employees, etc., about 5%-8%; hosting costs include land, buildings, employees, etc., about 5%-10%. **
Our main concern is network and reliable data center. AWS was not a good fit due to network limitations and unreliable hardware.
——Deep Learning Researcher
**• How does GPUDirect technology help in LLM training? **
NVIDIA's GPUDirect is not required for LLM training, but it can also help performance:
GPUDirect technology can improve performance, but not necessarily a supercritical difference. It mostly depends on where your system bottleneck is. For some architectures/software implementations, the system bottleneck is not necessarily the network. **But in the case of networking, GPUDirect can improve performance by 10%-20%, which is a considerable number for expensive training running costs. **
Nonetheless, GPUDirect RDMA is now so ubiquitous that its popularity almost speaks for itself. I think GPUDirect support is weak for non-Infiniband networks, but most GPU clusters optimized for neural network training have Infiniband networks/cards. The bigger factor for performance is probably NVLink, since it's rarer than Infiniband, but it's also only critical if you employ a specific parallelization strategy.
So features like powerful networking and GPUDirect can make less sophisticated software work out of the box. However, GPUDirect is not strictly required if cost or legacy infrastructure is considered.
—— A deep learning researcher
GPUDirect:
The data transmission technology called GPUDirect Storage (GPUDirect Storage) introduced by NVIDIA is mainly used to speed up the transmission of data stored in various storages to GPU memory, which can increase the bandwidth by 2 to 8 times, and can also reduce The end-to-end delay is up to 3.8 times. In the past, the CPU was responsible for loading data from memory to the GPU, which greatly limited hardware performance.
The standard path for data transfer from NVMe disk to GPU memory is to use the bounce buffer (Bounce Buffer) in system memory, which is an additional data copy. The core of GPUDirect storage technology is to avoid the use of rebound cache to reduce additional data copies, and use the direct memory access engine (Direct Memory Access, DMA) to put data directly into GPU memory.
**Why can't LLM company use AMD's GPU? **
An executive of a private cloud company said that it is theoretically feasible to purchase AMD GPUs, but it takes a certain amount of time from the purchase to the actual operation of the equipment. Enter the market late. Therefore, CUDA is NVIDIA's current moat.
A MosaicML study mentioned that AMD GPUs are also suitable for large model training tasks. They experimented with a simple training task based on PyTorch without any code modification compared to running on NVIDIA. The authors show that as long as the code base is built on PyTorch, it can be used directly on AMD without additional adaptation. In the future, the author plans to verify the performance of the AMD system on a larger computing cluster.
At the same time, there is also a view that considering that the cost of a model training is close to 300 million US dollars, no one will risk relying on chips from AMD or other startups on a large scale, especially when the chip demand is on the order of more than 10,000 .
A retiree in the semiconductor industry also mentioned that AMD's supply situation is not optimistic, and TSMC's CoWoS production capacity has been absorbed by NVIDIA, so although MI250 may be a viable alternative, it is also difficult to obtain.
H100 VS A100
NVIDIA A100:
The upgrade of NVIDIA V100, compared with V100, the performance of A100 has been improved by 20 times, which is very suitable for tasks such as AI and data analysis. Consisting of 54 billion transistors, the A100 integrates third-generation Tensor cores with acceleration for sparse matrix operations, especially useful for AI reasoning and training. Additionally, multiple A100 GPUs can be leveraged for larger AI inference workloads with NVIDIA NVLink interconnect technology.
NVIDIA H100:
The next generation of the A100 is the latest chip optimized for large models. It is based on the Hopper architecture, built using TSMC's 5nm custom version process (4N), and a single chip contains 80 billion transistors. Specifically, NVIDIA proposed the Transformer Engine, which integrates multiple precision calculations and the dynamic processing capabilities of the Transformer neural network, enabling the H100 GPU to greatly reduce model training time. Based on H100, NVIDIA has also launched a series of products such as machine learning workstations and supercomputers, such as 8 H100s and 4 NVLinks combined to form a giant GPU - DGX H100.
Compared to the A100, the H100's 16-bit inference speed is about 3.5 times faster, and the 16-bit training speed is about 2.3 times faster.
Most people tend to buy the H100 for model training and inference, and use the A100 mainly for model inference. However, one may also consider the following factors:
**• Cost: **H100 is more expensive than A100;
**• Capacity: **A100 and H100 are different in computing power and memory;
**• Use of new hardware: **Adoption of H100 requires corresponding adjustments in software and workflow;
**• Risk: ** There are more unknown risks in setting H100;
**• SOFTWARE OPTIMIZED: **Some software has been optimized for A100.
Overall, despite the higher performance of the H100, there are times when it makes sense to choose the A100,** which makes upgrading from the A100 to the H100 not an easy decision with many factors to consider. **
In fact, the A100 would become the V100 it is today in a few years. Considering the performance constraints, I think almost no one will train LLM on V100 now. But the V100 is still being used for inference and other tasks. Likewise, the price of the A100 may drop as more AI companies turn to the H100 to train new models, but there will always be demand for the A100, especially for inference.
I think that could lead to a flood of A100s in the market again as some hugely funded startups end up going out of business.
— (Another) Private Cloud Executive
But over time, people will use the A100 for more and more inference tasks instead of training the latest and larger models. **The performance of V100 can no longer support the training of large models, and high-memory graphics cards are more suitable for large models, so cutting-edge teams prefer H100 or A100.
The main reason for not using V100 is the lack of brainfloat16 (bfloat16, BF16) data types. Without this type of data, it is difficult to train models easily. The main reason for the poor performance of OPT and BLOOM is the absence of this data type (OPT was trained in float16, BLOOM was mostly prototyping done in FP16, which made it impossible to generalize the data to training runs done in BF16 ).
——Deep Learning Researcher
**• What is the difference between Nvida's GPUs H100, GH200, DGX GH200, HGX H100 and DGX H100? **
• H100 = 1x H100 GPU;
• HGX H100 = NVIDIA server reference platform. Used by OEMs to build 4-GPU or 8-GPU servers, manufactured by third-party OEMs such as Supermicro;
• DGX H100 = Official NVIDIA H100 server with 8x H100, NVIDIA is its sole supplier;
• GH200 = 1x H100 GPU plus 1x Grace CPU;
• DGX GH200 = 256x GH200, coming late 2023, probably only from NVIDIA;
• MGX for large cloud computing companies.
Of these, most companies chose to purchase the 8-GPU HGX H100 instead of the DGX H100 or 4-GPU HGX H100 servers.
**How much do these GPUs cost separately? **
1x DGX H100 (SXM) with 8x H100 GPUs costs $460,000, including required support services, etc., about $100,000. Startups can get an Inception discount of about $50,000 for up to 8x DGX H100 boxes, for a total of 64 H100s.
The specific specifications of the GPU are as follows:
1x HGX H100 (SXM) with 8x H100 GPUs is priced between $300,000-380,000 depending on specs (network, storage, memory, CPU) and vendor margins and support levels. If the specs are exactly the same as the DGX H100, businesses may pay a higher price of $360,000 to $380,000 including support.
1x HGX H100 (PCIe) with 8x H100 GPUs is approximately $300k including support, depending on specs.
The market price for a PCIe card is around $30,000 to $32,000.
SXM graphics cards are not sold as single cards, so pricing is difficult. Generally only sold as 4GPU and 8GPU servers.
About 70-80% of the demand in the market is for SXM H100, and the rest is for PCIe H100. Demand for the SXM segment is on the rise, as only PCIe cards were available in previous months. Given that most companies are buying 8GPU HGX H100s (SXMs), that's roughly $360,000-$380,000 per 8 H100s, including other server components.
DGX GH200 contains 256x GH200, and each GH200 contains 1x H100 GPU and 1x Grace CPU. According to estimates, the cost of DGX GH200 may be between 15 million - 25 million US dollars.
**What is the market demand for GPU? **
• GPT-4 training may be done on 10,000 to 25,000 A100 sheets;
• Meta has about 21,000 A100s, Tesla has about 7,000 A100s, and Stability AI has about 5,000 A100s;
• Falcon 40B training was performed on 384 A100s;
• Inflection uses 3500 H100 sheets in its GPT-3.5 equivalent model.
We will have 22,000 GPUs in use by December, and well over 3,500 units in use today.
— Mustafa Suleyman, CEO, Inflection AI
**According to Elon Musk, GPT-5 training may use 30,000-50,000 H100. **Morgan Stanley proposed in February 2023 that GPT-5 will use 25,000 GPUs, and they also proposed at the time that GPT-5 was already in training, but Sam Altman later denied this in May of this year, saying OpenAI did not train GPT-5, so Morgan Stanley's information may not be accurate.
GCP has about 25,000 H100s, and Azure may have 10,000-40,000 H100s. It should be similar for Oracle. Additionally, most of Azure's capacity will be provisioned to OpenAI.
CoreWeave maintains approximately 35,000 to 40,000 H100s, but this is based on orders, not actuals.
**How many H100s did Startup order? **If used for LLM fine-tuning task, usually tens or hundreds of sheets are ordered; if used for LLM training, thousands of sheets are required.
**How much H100 might a company in the LLM sector need? **
• OpenAI may need 50,000, Inflection may need 24,000, and Meta may need 25,000 (there are also sayings that Meta actually needs 100,000 or more);
• Large cloud service providers, such as Azure, Google Cloud, AWS and Oracle may each need 30,000;
• Private cloud service providers, such as Lambda and CoreWeave, and other private clouds may add up to 100,000;
• Anthropic, Helsing, Mistral, Character may cost 10k each.
The numbers above are estimates and guesswork, and some of them may be double-counted, such as customers who lease the cloud. **In general, according to current calculations, the number of H100s is about 432,000. If calculated at about US$35,000 each, this is a GPU with a total value of about US$15 billion. Also, the 432,000 figure doesn't include Chinese companies like ByteDance (TikTok), Baidu, and Tencent that require a lot of H800s. **
In addition, some financial companies are also deploying A100/H100 ranging from hundreds to thousands: such as Jane Street, JP Morgan, Two Sigma and Citadel.
**How does this compare to NVIDIA data center revenue? **NVIDIA data center revenue of $4.28 billion for February-April 2023. Between May 25 and July 2023, data center revenue could be around $8 billion. **This is primarily based on the assumption that NVIDIA's higher revenue guidance for the quarter is primarily due to higher data center revenues rather than higher revenues from other business areas. **
Therefore, it may take some time for the supply shortage to ease. But it is possible that the shortage of computing power has been exaggerated. First of all, most companies do not buy all the H100 they need immediately, but upgrade gradually; in addition, NVIDIA is also actively increasing production capacity.
Having 400,000 H100s in the market as a whole is not out of reach, especially considering that everyone is deploying 4 or 5 figure H100s in large numbers these days.
—— A private cloud executive
Summarize
• Most large CSPs (Azure, AWS, GCP, and Oracle) and private clouds (CoreWeave, Lambda, and various others) prefer more H100 GPUs than just being able to access them, most large AI offerings The company is also pursuing more H100 GPUs.
• Typically these companies want an 8GPU HGX H100 chassis with SXM cards. Depending on specs and support, each 8GPU server costs approximately $3-4 million. There may be excess demand for hundreds of thousands of H100 GPUs, with a total value of more than $15 billion;
• With limited supply, NVIDIA could have raised prices to find a market equilibrium price, and to some extent it did. All in all, the ultimate decision on how to allocate the H100 GPU depends on which customers NVIDIA itself prefers to allocate it to.
03.H100 Supply side analysis
Bottleneck from TSMC
The H100s are produced by TSMC (TSMC), **Can NVIDIA choose other chip factories to produce more H100s? At least not yet. **
NVIDIA has cooperated with Samsung in the past, but Samsung has not been able to meet their needs for cutting-edge GPUs, so currently NVIDIA can only use H100s GPUs and other 5nm GPUs produced by TSMC. **Maybe in the future, NVIDIA will cooperate with Intel, or continue to cooperate with Samsung on related technologies, but neither of these situations will happen in the short term, so the supply shortage of H100 will not be eased. **
TSMC's 5-nanometer (N5) technology will enter mass production in 2020. N5 technology is TSMC's second EUV process technology, offering faster speed and lower power consumption than the previous N7 technology. In addition, TSMC also plans to launch 4-nanometer (N4) technology, which is an enhanced version of N5 technology that will further improve performance and power consumption, and plans to start mass production in 2022.
The H100 is produced based on the TSMC 4N process, which belongs to the enhanced 5nm in the 5nm series, not the real 4nm process. **In addition to NVIDIA, Apple is also using this technology, but they have mainly moved to N3 and kept most of N3 capacity. **Also, Qualcomm and AMD are big customers of N5 series.
The A100 uses TSMC's N7 process.
7 nanometers (N7) is the process node that TSMC will put into mass production in 2019. On the basis of N7, TSMC also introduced the N7+ process, which is a 7nm manufacturing process using EUV (extreme ultraviolet lithography), which increases the transistor density by 15% to 20% while reducing chip power consumption.
Generally, the front-end process capacity (Fab Capacity) will be planned more than 12 months in advance. It is pointed out that TSMC and its major customers will jointly plan the production demand for the next year, so the current H100 supply shortage is partly due to TSMC and NVIDIA's misjudgment of this year's H100 demand in the previous year.
Fab Capacity:
In the semiconductor chip process flow, Fab is the abbreviation of FABRICATION (processing, manufacturing), and Fab Capacity can be considered as capacity capacity.
According to another source, it usually takes 6 months for the H100 to be sold to customers (production, packaging and testing) from the start of production, but this situation has yet to be confirmed.
A retired professional in the semiconductor industry pointed out that wafer production capacity is not the bottleneck of TSMC, but the real bottleneck lies in the aforementioned CoWoS (three-dimensional stacking).
CoWoS (Chip on wafer on Substrate, three-dimensional stacking):
It is a 2.5D integrated production technology of TSMC. First, the chip is connected to the silicon wafer through the CoW (Chip on Wafer) packaging process, and then the CoW chip is connected to the substrate (Substrate), and integrated into CoWoS.
According to DigiTimes, TSMC has begun to expand its CoWoS production capacity, and plans to increase CoWoS production capacity from 8,000 wafers per month to 11,000 wafers per month by the end of 2023, and to around 14,500 to 16,600 wafers per month by the end of 2024. Major tech giants such as NVIDIA, Amazon, Broadcom, Cisco, and Xilinx have all increased demand for TSMC's advanced CoWoS packaging.
H100 Memory
**Memory Type (Memory Bype), Memory Bus Width (Memory Bus Width) and Memory Clock Speed (Memory Clock Speed) jointly affect the memory bandwidth of the GPU. **NVIDIA designed the bus width and clock speed of the H100 as part of the GPU architecture. HBM3 memory is mainly used on H100 SXM, and HBM2e is mainly used on H100 PCIe.
HBM is difficult to produce and the supply is very limited, so producing HBM is a nightmare. But once the HBM is produced, the rest of the design becomes easy.
——A Deepl Learning researcher
**Memory type, memory bus width, and memory clock speed are three important indicators of computer memory. **
Memory Bus Width:
It refers to the width of the data transmission channel between the memory module and the motherboard. A wider memory bus width can provide a larger data path, thereby increasing the data transmission speed between the memory and the processor.
Memory Clock Speed:
Refers to the working clock frequency of the memory module. A higher memory clock speed means that the memory can perform read and write operations faster and provide a higher data transmission speed.
HBM(High Bandwidth Memory):
Is a high-bandwidth memory technology used to provide fast memory access speeds in graphics processing units (GPUs) and other high-performance computing devices. The memory technology used in traditional graphics cards and computing devices is usually based on GDDR (Graphics Double Data Rate) design, which has a certain balance between performance and power consumption. HBM technology achieves higher bandwidth and lower power consumption by placing memory stacks on GPU chips and stacking multiple DRAM chips together through high-speed vertical connections (TSVs).
For HBM3 memory, NVIDIA may use all or mainly SK Hynix. It is not sure whether NVIDIA's H100 uses Samsung's memory, but it is certain that NVIDIA does not currently use Micron's memory.
As far as HBM3 is concerned, generally speaking, SK Hynix has the largest output, followed by Samsung, and the third-ranked Micron has a large output gap with the former two. It appears that SK Hynix has ramped up production, but NVIDIA still wants them to produce more, while Samsung and Micron haven't managed to ramp up production yet.
**What else is used in the manufacture of GPUs? **
In addition, the production of GPU will also involve a lot of metal materials and parts. The shortage of raw materials in these links will also cause the supply bottleneck of GPU, such as:
**• Metals and chemicals: **Includes silicon (metalloids) such as copper, tantalum, gold, aluminum, nickel, tin, indium and palladium, which are used in various stages of production, from silicon Round manufacturing to final assembly of GPU, such as silicon, rare earth, etc.;
**• Components and packaging materials: **Such as substrates, solder balls and wires, heat dissipation compounds, etc., which are used to complete the assembly and link of various components of the GPU, and are critical to the operation of the GPU;
**• Energy Consumption:**Due to the use of high-precision mechanical equipment during the manufacturing process of GPU chips, a large amount of electricity is required.
**How is NVIDIA addressing the H100 shortage? **
NVIDIA revealed that they will increase supply in the second half of this year. NVIDIA CFO said at the financial report that the company is doing its best to solve the supply problem, but other than that, they did not convey any more information, nor did they have any specific figures related to H100. .
"We're working through our supply issues for the quarter, but we've also bought a lot of stock for the second half of the year."
"We believe that the supply in the second half of the year will be significantly higher than in the first half."
-- Colette Kress, Nvidia's CFO, on the February-April 2023 earnings call
A private cloud company executive believes that **a vicious circle may emerge in the market next, that is, scarcity causes GPU capacity to be regarded as a moat for AI companies, which leads to more GPU hoarding, which in turn further Exacerbating the scarcity of GPUs. **
According to the historical interval between NVIDIA's launch of different architectures, the next-generation model of the H100 may not be released until the end of 2024 (mid-2024 to early 2025). Before that, H100 will always be the top-level product of NVIDIA GPU (GH200 and DGX GH200 are not counted, they are not pure GPU, and both use H100 as GPU).
In addition, it is expected that there will be a 120GB version with larger memory in the future.
04. How to get H100
Seller of H100
Original Equipment Manufacturers (OEMs) such as Dell, HPE, Lenovo, Supermicro and Quanta are selling the H100 and HGX H100, while ordering InfiniBand needs to be done through NVIDIA Mellanox.
Mellanox is one of the major global InfiniBand suppliers. In 2015, Mellanox's share in the global IB market reached 80%. In 2019, NVIDIA acquired Mellanox for $125 per share, for a total transaction value of approximately $6.9 billion. This acquisition enables NVIDIA to further expand its market share in high-performance computing and data centers, and strengthens NVIDIA's competitiveness in the field of AI.
By combining Mellanox's high-speed interconnect technology with NVIDIA's GPU accelerators, NVIDIA can provide data centers with higher bandwidth and lower latency solutions. In addition to Mellanox, the IB technology of QLogic, another supplier in the IB field, was acquired by Intel Corporation in 2012.
GPU clouds like CoreWeave and Lambda buy GPUs from OEMs and lease them to Startups. The hyperscale cloud players (Azure, GCP, AWS, Oracle) are able to buy more directly with NVIDIA, but they also sometimes work with OEMs.
For DGX, the purchase is also done through OEM. Although customers can communicate with NVIDIA on purchasing requirements, the purchase is through OEM instead of directly placing a purchase order with NVIDIA.
The delivery times for the 8 GPU HGX servers are terrible and the 4 GPU HGX servers are pretty good, but the reality is that everyone wants 8 GPU servers.
**• How long does it take from placing an order to deploying the H100? **
Deployment is a phased process. Let's say an order of 5,000 GPUs, they might get access to 2,000 or 4,000 GPUs in 4-5 months, and then the remaining GPUs in 6 months or so.
For Startup, if you want to buy a GPU, you don’t place an order from an OEM or a reseller. They generally choose public cloud services such as Oracle, or rent access rights to private clouds such as Lambda and CoreWeave, or use services such as FluidStack and OEMs and providers who work with data centers lease access.
**• Should the enterprise build its own data center or colocation? **
For the establishment of a data center, factors that need to be considered include the time to establish the data center, whether there are talents and experience in hardware, and the scale of capital investment.
Renting and hosting a server is much easier. If you want to build your own data center, you have to lay a dark fiber line to your location to connect to the Internet, and the cost of fiber is $10,000 per kilometer. During the Internet boom, most of the infrastructure was already built and paid for. Now, you can just rent, and it's pretty cheap.
—— A private cloud executive
Choosing to rent or build a data center is an either-or decision. According to actual needs, enterprises can have the following different options:
On-demand cloud: purely use cloud services for leasing;
Reserved cloud;
Hosting (purchasing a server, cooperating with a provider to host and manage the server);
Self-hosting (purchasing and hosting a server yourself).
Most Startups that need a lot of H100 will opt for reserved cloud or colocation.
**How do enterprises choose a cloud service company? **
There is a view that Oracle's infrastructure is not as reliable as the three major clouds, but it is willing to spend more time on customer technical support. Some private cloud company practitioners said that 100% of them will have a large number of dissatisfied customers with Oracle-based services, and some CEOs of other companies believe that Oracle's networking capabilities are stronger.
**Generally, Startup will select the company with the strongest combination of service support, price and capacity. **
The main differences between several large cloud service companies are:
**• Networking: **AWS and Google Cloud have been slower to adopt InfiniBand as they have their own approaches, but most startups looking for large A100/H100 clusters are looking for InfiniBand;
**• Availability: **For example, most of Azure's H100 computing power is used by OpenAI, which means that there may not be much computing power available to other customers.
**Although there is no factual basis, there is speculation that NVIDIA is more inclined to prioritize GPU supply for cloud service providers that have not developed competing machine learning chips. **All three major cloud service providers are currently developing their own machine learning chips, but AWS and Google's NVIDIA alternatives are already on the market and stealing some of NVIDIA's market share. This has also led to some market speculation that NVIDIA is more willing to cooperate with Oracel because of this.
Some of the big cloud companies have better prices than others. As one private cloud exec noted, "For example, A100 on AWS/AZURE is much more expensive than GCP."
Oracle told me they will have "tens of thousands of H100s" in service later this year. But in terms of pricing, they are higher than other companies. They didn't give me pricing for the H100, but for the A100 80GB, they quoted me close to $4/hour, which is almost 2x more than what GCP was quoting, and at the same power consumption and effort.
— Anonymous
Smaller clouds have an advantage in terms of pricing, except in some cases where one of the big cloud companies might do an odd deal in exchange for equity.
So on the whole, in terms of the closeness of cooperation with NVIDIA, Oracle and Azure > GCP and AWS, but this is just a guess.
Oracle pioneered the A100s and hosted Nvidia-based clusters in partnership with Nvidia, which is also an Azure customer.
**• Which large cloud company has the best network performance? **
Azure, CoreWeave, and Lambda all use InfiniBand. Oracle's network performance is good at 3200 Gbps, but uses Ethernet instead of InfiniBand, and can be around 15-20% slower than IB for use cases like high parameter LLM training. The networks of AWS and GCP are not as good.
**• How do enterprises choose cloud services at present? **
A statistical data for 15 companies shows that all 15 companies surveyed will choose AWS, GCP or Azure, and Oracle is not among them.
Most businesses tend to use their existing cloud. But for entrepreneurial teams, their choices are more based on reality: whoever can provide computing power will choose whichever one.
**• Who is NVIDIA working with on DGX Cloud? **
"Nvidia is partnering with leading cloud service providers to host DGX Cloud Infrastructure, starting with Oracle Cloud Infrastructure" - sell with Nvidia, but lease through existing cloud providers (first with Oracle, then Azure, followed by Google Cloud, which did not work with AWS).
NVIDIA CEO Jensen Huang said on NVIDIA's earnings call that "the ideal mix is 10% NVIDIA DGX cloud and 90% CSP cloud".
• The H100 schedule of the cloud giants:
CoreWeave was one of the first. As an investor of CoreWeave, and in order to strengthen the competition among large cloud companies, NVIDIA was the first to complete the delivery for CoreWeave.
The H100 schedule of other cloud service companies is as follows:
• Azure announced the availability of H100 for preview on March 13;
• Oracle announced limited supply of H100 on March 21;
• Lambda Labs announced on March 21 that it will launch the H100 in early April;
• AWS announced on March 21 that the H100 will be in preview in a few weeks;
• Google Cloud announced the start of the H100 private preview on May 10th.
**• Which cloud services are different companies using? **
• OpenAI: Azure
• Inflection: Azure and CoreWeave
• Anthropic: AWS 和 Google Cloud
• Cohere:AWS 和 Google Cloud
• Hugging Face: AWS
• Stability AI: CoreWeave and AWS
• Character.ai: Google Cloud
• X.ai: Oracle
• NVIDIA: Azure
**How to get more GPU quota? **
The final bottleneck is whether the distribution of computing power can be obtained from NVIDIA.
**• How does NVIDIA select customers? **
NVIDIA usually allocates a certain number of GPUs to each customer, and in this process **NVIDIA is most concerned about "who is the end customer", for example, Azure said "we want to buy 10,000 H100s to support Inflection" , and the result corresponding to Azure saying "We purchased 10,000 H100s for Azure" is different. **If NVIDIA is interested in a particular end customer, it is possible for the cloud company to get additional GPU quota. Therefore, NVIDIA hopes to know as much as possible who the end customers are, and they will be more inclined to large enterprises or startups with strong endorsements.
Yes, that appears to be the case. Nvidia likes to give GPU access to AI startups (many of which have close ties to Nvidia). Inflection, an AI company invested by Nvidia, is testing a huge H100 cluster on CoreWeave.
—— A private cloud executive
If a certain cloud company brings an end customer to NVIDIA and expresses that they are ready to purchase a certain amount of H100, and NVIDIA is interested in this end customer, NVIDIA will generally give a certain quota, which will actually increase the amount that NVIDIA allocates to the end customer. The total capacity of the cloud company, because this allocation is independent of the quota originally given to the cloud company by NVIDIA.
NVIDIA's allocation of large capacity to private clouds is a special case: **CoreWeave has more H100s than GCP. NVIDIA is reluctant to allocate significant resources to companies that try to compete directly with it (AWS Inferentia and Tranium, Google TPUs, Azure Project Athena). **
But at the end of the day, if you submit a purchase order and money to NVIDIA, commit to a bigger deal with more upfront funding, and indicate your low-risk profile, you're bound to get more GPU quota than anyone else.
05. Summary
Even though, as Sam Altman said, "the era of using large models is coming to an end", for now we are still limited by the GPU. On the one hand, companies like OpenAI already have excellent PMF products such as ChatGPT, but because they are limited by GPUs, they need to purchase a large amount of computing power. On the other hand, many teams are working on the possibility of participating in LLM in the future Hoarding GPUs regardless of their potential to create something like ChatGPT.
But there is no doubt that NVIDIA's right to speak will not be shaken.
At this stage, the best LLM product that PMF does is ChatGPT. The following uses ChatGPT as an example to explain why there is a shortage of GPUs:
Because ChatGPT is so popular with users, its ARR (annual recurring revenue) may exceed 500 million US dollars;
ChatGPT runs on the API of GPT-4 and GPT-3.5;
The APIs of GPT-4 and GPT-3.5 require a GPU to run, and a large number of GPUs are required. OpenAI hopes to release more functions for ChatGPT and its API, but it cannot be realized due to the limited number of GPUs;
OpenAI purchased a large number of NVIDIA GPUs through Microsoft (Azure);
To manufacture the H100 SXM GPU, NVIDIA uses TSMC for manufacturing and uses TSMC's CoWoS packaging technology and HBM3 mainly from SK Hynix.
In addition to OpenAI, there are many companies in the market that are training their own large models. Let’s put aside how many bubbles exist in LLM, and how likely are PMF products to appear in the end, but in general, the LLM competition has pushed up the market’s demand for GPUs. demand. In addition, there are some companies that even if they don't need GPUs for the time being, they will start to stockpile them in advance because they are worried about the future. So it's kind of like "the expectation of a supply shortfall exacerbates the supply shortfall"**.
So, another force driving up the demand for GPUs are enterprise companies that want to create new LLMs, or participate in AI in the future:
The importance of large models has become a consensus: if it is a mature enterprise, it hopes to train LLM on its own data and hopes that it will bring more business value; as a start-up company, it hopes to build its own LLM and transform it into commercial value. The GPU is just needed for training large models;
Communication between these enterprises and large cloud vendors (Azure, Google Cloud, AWS), trying to obtain enough H100;
During the process, they found that cloud vendors did not have enough H100 to allocate, and some cloud vendors also had flawed network configurations, so CoreWeave, Oracle, Lambda, and FluidStack also became buy GPUs and own them, maybe they also discuss with OEM and NVIDIA;
In the end, they got a lot of GPUs;
Now, they are trying to match their product to the market;
In case it wasn't already clear, the path is not easy - remember that OpenAI achieved product-market fit on a smaller model and then scaled it up. But now to achieve product-market fit, you have to fit your user's use case better than OpenAI's model, so you need more GPUs than OpenAI to begin with.
**At least until the end of 2023, there will be shortages for enterprises deploying hundreds or thousands of H100s, maybe by the end of 2023, the situation will become clearer, but it seems that the shortage of GPUs may continue until 2024. **
Reference
Comment from a custom LLMs-for-enterprises startup founder
Message from an at a cloud provider
Conversations with s at cloud companies and GPU providers
Tesla Q1 2023 (covers Jan 1 2023 to Mar 31 2023) earnings call
A comment from an at a cloud company
A guesstimate ballpark from an at a cloud company
︎