Running Llama 3 70B in FP16 on NVIDIAs H100s is a coordination problem before it's a compute problem. 140 GB of weights, 80 GB per GPU -> so you're already in tensor parallelism, NVLink topology, sharded KV cache, and inter-GPU traffic on the critical path. Two GPUs minimum, and the failure domain grows with every one you add. The #MI350X has 288 GB of HBM3e and 8 TB/s of bandwidth on a single accelerator. The same model fits on one GPU with 148 GB left for KV cache and long context. One machine, one process, one thing that can break. We have MI350X VMs, hosted by Grafica Veneta S.p.A., available on demand at $3.65/hour, with discounts for longer commitments. Hourly billing, no minimums, from a single accelerator up to 8-GPU configurations with 2.3 TB of HBM3e. If sharding is what's slowing your inference stack down, this is the GPU to try first: https://lnkd.in/gw2hK-c9 #AMDInstinct #ROCm #LLMinference
CloudRift’s Post
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development