Calculating the Total Cost of a GPU Cluster

AI model training and inference at scale can be expensive. Enterprise buyers often evaluate cloud GPU providers on a cost-per-hour basis, zeroing in on the most visible line item: the GPUs. However, price per GPU-hour alone can be misleading. In practice, two cloud offerings with identical rates can produce materially different total cost of ownership once the full set of cost drivers behind training and inference is considered.

The calculations included in this paper were meant to provide real-world data backing up the intuition that users and providers have built over the years on the importance of running reliable, performant, and easy-to-use clusters. In other words, even in scenarios where pricing per GPUhour is equal, there are always hidden costs across storage, network, control plane, support, goodput, setup, and debugging expenses. We demonstrate that in three real-world scenarios, AWS can be 9% to 113% more expensive on a TCO-adjusted basis versus Nebius when using real-world pricing or holding GPU-hr pricing constant. We also demonstrate that silver-tier neoclouds can be 4% to 8% more expensive when holding GPU-hr pricing equal.

Complete this form to
download the book

TCO SemiAnalysis Whitepaper & Webinar