Last month we benchmarked libvirt and QEMU configurations across six GPU systems: H200, MI350X, RTX PRO 6000 Blackwell, 4090, 4080, and 5090. We wanted to see how much performance the defaults are leaving on the table. On the H200 host, default libvirt gave us 77 GB/s of memory bandwidth. The same hardware with vCPU pinning and guest NUMA topology exposure delivered 561 GB/s, a 7.3x gain from configuration alone. MI350X and RTX PRO 6000 showed the same shape with smaller magnitudes (2.8x and 3.2x). That kind of result tempts you to enable NUMA exposure on every guest. We'd have done the same a year ago. But when we ran NCCL benchmarks on the H200 with NUMA exposed and GPUs spanning multiple nodes, cross-node collectives dropped 57% in bandwidth. The setting that helps host memory access can quietly damage multi-GPU communication when GPU-to-node placement isn't deliberate. If you want one safe starting point, use vCPU pinning with a single guest NUMA node. That captures most of the latency improvement and sidesteps the NCCL penalty. Layer in NUMA exposure only after you've verified GPU placement and confirmed the workload is actually bandwidth-bound. Separately, SMT siblings (default at the BIOS level on most systems) gained us 2-3% of total CPU throughput while halving per-thread performance. Slawek's recommendation is to skip SMT for GPU workloads unless you're optimizing for high-density MIG or vGPU consolidation. Full walkthrough on our blog: https://lnkd.in/g_kHRifw #GPUcloud #NUMA #NCCL AMD
CloudRift
Software Development
Santa Clara, California 1,451 followers
GPU control-plane and monetization for national telcos and regulated industries: virtualization, monetization, inference
About us
Building sovereign AI infrastructure enabling enterprises to deploy and operate GPU cloud environments with full control over data, security, and economics: - Secure, multi-tenant GPU cloud platform for telcos, universities, and government organizations - GPU virtualization & orchestration (virtualization via NV AI Enterprise or open-source stack, full AMD platform support via ROCm ecosystem; bare-metal deployments; K8s) - Enterprise-grade isolation & access control (RBAC, SSH governance, audit logging, SOC 2 alignment) - Billing, SKUs, and revenue-sharing infrastructure for GPU monetization - LLM inference & deployment (vLLM, SGLang, custom model serving, custom kernel optimizations, prefill-decode disaggregation) - Advanced inference scheduling (quotas, rate-limits, smart queues, auto-scaling) - Storage, networking, and workload scheduling for high-performance, reliable deployments
- Website
-
https://cloudrift.ai
External link for CloudRift
- Industry
- Software Development
- Company size
- 11-50 employees
- Headquarters
- Santa Clara, California
- Type
- Privately Held
- Founded
- 2024
- Specialties
- cloud, software, distributed computing, machine learning, and infrastructure
Locations
-
Primary
Get directions
Santa Clara, California 95050, US
Employees at CloudRift
Updates
-
Running Llama 3 70B in FP16 on NVIDIAs H100s is a coordination problem before it's a compute problem. 140 GB of weights, 80 GB per GPU -> so you're already in tensor parallelism, NVLink topology, sharded KV cache, and inter-GPU traffic on the critical path. Two GPUs minimum, and the failure domain grows with every one you add. The #MI350X has 288 GB of HBM3e and 8 TB/s of bandwidth on a single accelerator. The same model fits on one GPU with 148 GB left for KV cache and long context. One machine, one process, one thing that can break. We have MI350X VMs, hosted by Grafica Veneta S.p.A., available on demand at $3.65/hour, with discounts for longer commitments. Hourly billing, no minimums, from a single accelerator up to 8-GPU configurations with 2.3 TB of HBM3e. If sharding is what's slowing your inference stack down, this is the GPU to try first: https://lnkd.in/gw2hK-c9 #AMDInstinct #ROCm #LLMinference
-
V100 32GB VMs are now on CloudRift at $0.29 per GPU/hour, in partnership with Cato Digital. A few things worth knowing: The V100 is older silicon (Volta, launched 2017) but still capable. A single 32GB V100 fits a LoRA fine-tune of Llama 3 8B, Whisper Large inference, or a batch embedding pipeline. For workloads that do not require Hopper-class hardware, it is more than enough. The same GPU on AWS or Azure runs above $3 per GPU/hour, and the 32GB variant is usually only sold in 8-GPU bundles. CloudRift offers it as a single-GPU VM, billed by the second, with no minimums. The capacity comes from Cato Digital. Rather than commissioning new servers, they redeploy enterprise GPU systems retired from Meta and NVIDIA fleets, which gives the hardware a longer operational life and a smaller carbon footprint than buying new. If you are running fine-tuning, batch inference, rendering, or scientific compute on a budget, this is roughly a tenfold reduction in compute cost for workloads where it fits. Thanks to the team of Cato Digital and Colin Murcray for the partnership! → https://lnkd.in/gPChGrQM #GPU #AI #SustainableAI #MLOps
-
-
Dmitry built a complete LLM compiler from scratch to document how a modern ML compiler stack works end to end. 5,000 lines, two weeks, no library use, just pure Python and raw CUDA. The pipeline takes a PyTorch graph through six intermediate representations: Torch IR, Tensor IR, Loop IR, Tile IR, Kernel IR, CUDA. Each lowering moves closer to the hardware: decompose Torch ops, convert to loops and fuse, schedule kernels, render CUDA. GELU at seq=32 runs 31 µs in eager PyTorch and 6 µs in our stack, a 4.87x speedup. Softmax sits at parity with eager. Matmul lands at 50% to slightly above NVIDIA cuBLAS depending on shape. Vendor kernels are still hard to beat at full prefill on the FFN-width matmuls, which is why every production stack falls back to cuBLAS, cuDNN, and CUTLASS on the heavy hitters. https://lnkd.in/g6qbVdFv #CUDA #MLCompilers #PyTorch
-
Check out our new deep dive comparing open-source VFIO, NVIDIA AI Enterprise, and AMD GPU virtualization.
We wrote a deep-dive comparing open-source VFIO, NVIDIA AI Enterprise, and AMD GPU virtualization. We were pleasantly surprised by how easy it was to use the AMD virtualization stack. It relies on modern SR-IOV virtualization and was easiest to set up by far. https://lnkd.in/e_tUGr3U
-
Exciting collaboration with dstack: rent and orchestrate on-demand clusters of AMD MI350X! Go Team Red!
dstack 0.20.15 is out. Nice milestone for the AMD compute ecosystem: with dstack, you can now provision AMD MI350X GPUs via CloudRift, one of the first providers to offer them on-demand. Also in this release: ROCm 7.x compatibility and fixes for SSH fleets. Release notes: https://lnkd.in/gs5D4Acc
-
-
Want to optimize your LLM model for your hardware? Check out our latest article to see how you can run your own benchmark for free in March.
I spent the past week tuning Qwen3 Coder on two prosumer GPUs, RTX 5090 and RTX PRO 6000. The optimization process: 1. Framework selection (vLLM vs SGLang -- the winner depends on the GPU and quantization) 2. Finding max context length that fits VRAM without hurting throughput 3. Sweeping max concurrent requests to find the sweet spot between throughput and latency Results: - RTX 5090 (32GB): 1,157 tok/s with 114K context on vLLM - PRO 6000 (96GB): 988 tok/s with 262K context on vLLM (up to 1,207 tok/s if you trade latency for throughput) Everything is open source: recipes, benchmarking data, and the tool I built to run these sweeps (https://lnkd.in/eWfNXCbW). --- Want to run your own benchmarks? I'm opening the infrastructure for free this month. I have GCP credits expiring in March and CloudRift GPUs available. GPUs include RTX 4090, RTX 5090, L40S, PRO 6000, H100, H200, B200, and AMD MI350X. How to participate: 1. Fork cloudrift-ai/deplodock 2. Add your recipe to experiments/ 3. Open a PR -- I'll run it and post results back Join the Discord if you want help writing recipes or have questions. Let's build a proper open benchmark dataset together. https://lnkd.in/e-u4hEzK
-
AMD Instinct MI350X are available for rental! Built for the most demanding AI Workloads: - Up to 256 compute units with CDNA 4 architecture - 288 GB HBM3E memory with up to 8 TB/s bandwidth! - 128 GB/s inter-GPU connectivity via Infinity Fabric - SR-IOV virtualization support via MxGPU — enabling secure, near-native GPU sharing across multiple VMs with hardware-level isolation - Full ROCm ecosystem support for PyTorch, JAX, and HPC frameworks Whether you're running massive LLM training, multi-tenant inference, or HPC simulations, the MI350X delivers the memory capacity and bandwidth to handle models that simply don't fit elsewhere. Rent here: https://www.cloudrift.ai/
-
Check out our latest LLM Inference benchmark for B200 / H200 / H100 / RTX PRO 6000.
🫰 Who is the king of cost-effective inference? B200 vs H200 vs H100 vs RTX PRO 6000. We present an LLM inference throughput benchmark based on the vllm serve and vllm bench serve benchmarking tools, to understand the cost efficiency of the most popular NVIDIA GPUs. Pro 6000 is significantly cheaper as it has the latest Blackwell architecture, but it has slower GDDR (vs HBM) memory and lacks NVLink. B200 has the best specs, but is it worth a premium price tag? https://lnkd.in/guABDz26
-
B200 vs H200 vs H100 vs RTX PRO 6000. Who is the king of cost-efficient inference?
Not long ago NVIDIA’s Blackwell architecture has landed in datacenters with the B200 and RTX PRO 6000, promising major improvements in both performance and efficiency over the previous Hopper generation. But how do these gains translate to real-world LLM inference? I present an LLM inference throughput benchmark for RTX PRO 6000 SE vs H100 vs H200 vs B200, based on the vllm serve and vllm bench serve benchmarking tools. Pro 6000 is significantly cheaper as it has the latest Blackwell architecture, but it has slower GDDR (vs HBM) memory and lacks NVLink. B200 has much better specs than all others, but does the premium price tag make sense over more affordable alternatives? https://lnkd.in/guABDz26