AMD Instinct MI350X • 288GB Enterprise GPU | CloudRift

CloudRift’s Post

1,451 followers

Running Llama 3 70B in FP16 on NVIDIAs H100s is a coordination problem before it's a compute problem. 140 GB of weights, 80 GB per GPU -> so you're already in tensor parallelism, NVLink topology, sharded KV cache, and inter-GPU traffic on the critical path. Two GPUs minimum, and the failure domain grows with every one you add. The #MI350X has 288 GB of HBM3e and 8 TB/s of bandwidth on a single accelerator. The same model fits on one GPU with 148 GB left for KV cache and long context. One machine, one process, one thing that can break. We have MI350X VMs, hosted by Grafica Veneta S.p.A., available on demand at $3.65/hour, with discounts for longer commitments. Hourly billing, no minimums, from a single accelerator up to 8-GPU configurations with 2.3 TB of HBM3e. If sharding is what's slowing your inference stack down, this is the GPU to try first: https://lnkd.in/gw2hK-c9 #AMDInstinct #ROCm #LLMinference

AMD Instinct MI350X • 288GB Enterprise GPU | CloudRift cloudrift.ai

To view or add a comment, sign in

CloudRift’s Post

Explore content categories