CloudRift

CloudRift · 2026-02-11T03:28:36.462Z

B200 vs H200 vs H100 vs RTX PRO 6000. Who is the king of cost-efficient inference?

Software Development

Santa Clara, California 1,451 followers

GPU control-plane and monetization for national telcos and regulated industries: virtualization, monetization, inference

View all 7 employees

About us

Building sovereign AI infrastructure enabling enterprises to deploy and operate GPU cloud environments with full control over data, security, and economics: - Secure, multi-tenant GPU cloud platform for telcos, universities, and government organizations - GPU virtualization & orchestration (virtualization via NV AI Enterprise or open-source stack, full AMD platform support via ROCm ecosystem; bare-metal deployments; K8s) - Enterprise-grade isolation & access control (RBAC, SSH governance, audit logging, SOC 2 alignment) - Billing, SKUs, and revenue-sharing infrastructure for GPU monetization - LLM inference & deployment (vLLM, SGLang, custom model serving, custom kernel optimizations, prefill-decode disaggregation) - Advanced inference scheduling (quotas, rate-limits, smart queues, auto-scaling) - Storage, networking, and workload scheduling for high-performance, reliable deployments

Website: https://cloudrift.ai
External link for CloudRift
Industry: Software Development
Company size: 11-50 employees
Headquarters: Santa Clara, California
Type: Privately Held
Founded: 2024
Specialties: cloud, software, distributed computing, machine learning, and infrastructure

Locations

Primary

Santa Clara, California 95050, US

Get directions

Employees at CloudRift

See all employees

Updates

CloudRift

1,451 followers
22h
Report this post
Last month we benchmarked libvirt and QEMU configurations across six GPU systems: H200, MI350X, RTX PRO 6000 Blackwell, 4090, 4080, and 5090. We wanted to see how much performance the defaults are leaving on the table. On the H200 host, default libvirt gave us 77 GB/s of memory bandwidth. The same hardware with vCPU pinning and guest NUMA topology exposure delivered 561 GB/s, a 7.3x gain from configuration alone. MI350X and RTX PRO 6000 showed the same shape with smaller magnitudes (2.8x and 3.2x). That kind of result tempts you to enable NUMA exposure on every guest. We'd have done the same a year ago. But when we ran NCCL benchmarks on the H200 with NUMA exposed and GPUs spanning multiple nodes, cross-node collectives dropped 57% in bandwidth. The setting that helps host memory access can quietly damage multi-GPU communication when GPU-to-node placement isn't deliberate. If you want one safe starting point, use vCPU pinning with a single guest NUMA node. That captures most of the latency improvement and sidesteps the NCCL penalty. Layer in NUMA exposure only after you've verified GPU placement and confirmed the workload is actually bandwidth-bound. Separately, SMT siblings (default at the BIOS level on most systems) gained us 2-3% of total CPU throughput while halving per-thread performance. Slawek's recommendation is to skip SMT for GPU workloads unless you're optimizing for high-density MIG or vGPU consolidation. Full walkthrough on our blog: https://lnkd.in/g_kHRifw #GPUcloud #NUMA #NCCL AMD

GPU VM Performance: Do vCPU Pinning and NUMA Topology Really Matter? | CloudRift Blog cloudrift.ai

Like Comment Share
CloudRift

1,451 followers
5d
Report this post
Running Llama 3 70B in FP16 on NVIDIAs H100s is a coordination problem before it's a compute problem. 140 GB of weights, 80 GB per GPU -> so you're already in tensor parallelism, NVLink topology, sharded KV cache, and inter-GPU traffic on the critical path. Two GPUs minimum, and the failure domain grows with every one you add. The #MI350X has 288 GB of HBM3e and 8 TB/s of bandwidth on a single accelerator. The same model fits on one GPU with 148 GB left for KV cache and long context. One machine, one process, one thing that can break. We have MI350X VMs, hosted by Grafica Veneta S.p.A., available on demand at $3.65/hour, with discounts for longer commitments. Hourly billing, no minimums, from a single accelerator up to 8-GPU configurations with 2.3 TB of HBM3e. If sharding is what's slowing your inference stack down, this is the GPU to try first: https://lnkd.in/gw2hK-c9 #AMDInstinct #ROCm #LLMinference

AMD Instinct MI350X • 288GB Enterprise GPU | CloudRift cloudrift.ai

Like Comment Share
CloudRift

1,451 followers
1w Edited
Report this post
V100 32GB VMs are now on CloudRift at $0.29 per GPU/hour, in partnership with Cato Digital. A few things worth knowing: The V100 is older silicon (Volta, launched 2017) but still capable. A single 32GB V100 fits a LoRA fine-tune of Llama 3 8B, Whisper Large inference, or a batch embedding pipeline. For workloads that do not require Hopper-class hardware, it is more than enough. The same GPU on AWS or Azure runs above $3 per GPU/hour, and the 32GB variant is usually only sold in 8-GPU bundles. CloudRift offers it as a single-GPU VM, billed by the second, with no minimums. The capacity comes from Cato Digital. Rather than commissioning new servers, they redeploy enterprise GPU systems retired from Meta and NVIDIA fleets, which gives the hardware a longer operational life and a smaller carbon footprint than buying new. If you are running fine-tuning, batch inference, rendering, or scientific compute on a budget, this is roughly a tenfold reduction in compute cost for workloads where it fits. Thanks to the team of Cato Digital and Colin Murcray for the partnership! → https://lnkd.in/gPChGrQM #GPU #AI #SustainableAI #MLOps
1 Comment

Like Comment Share
CloudRift

1,451 followers
2w
Report this post
Dmitry built a complete LLM compiler from scratch to document how a modern ML compiler stack works end to end. 5,000 lines, two weeks, no library use, just pure Python and raw CUDA. The pipeline takes a PyTorch graph through six intermediate representations: Torch IR, Tensor IR, Loop IR, Tile IR, Kernel IR, CUDA. Each lowering moves closer to the hardware: decompose Torch ops, convert to loops and fuse, schedule kernels, render CUDA. GELU at seq=32 runs 31 µs in eager PyTorch and 6 µs in our stack, a 4.87x speedup. Softmax sits at parity with eager. Matmul lands at 50% to slightly above NVIDIA cuBLAS depending on shape. Vendor kernels are still hard to beat at full prefill on the FFN-width matmuls, which is why every production stack falls back to cuBLAS, cuDNN, and CUTLASS on the heavy hitters. https://lnkd.in/g6qbVdFv #CUDA #MLCompilers #PyTorch

A Principled ML Compiler Stack in 5,000 Lines of Python cloudrift.ai

Like Comment Share
CloudRift

1,451 followers
1mo
Report this post
Check out our new deep dive comparing open-source VFIO, NVIDIA AI Enterprise, and AMD GPU virtualization.

Dmitry Trifonov
1mo

We wrote a deep-dive comparing open-source VFIO, NVIDIA AI Enterprise, and AMD GPU virtualization. We were pleasantly surprised by how easy it was to use the AMD virtualization stack. It relies on modern SR-IOV virtualization and was easiest to set up by far. https://lnkd.in/e_tUGr3U

GPU Virtualization with VFIO, NVAI Enterprise, and AMD SR-IOV itnext.io

Like Comment Share
CloudRift

1,451 followers
1mo
Report this post
Exciting collaboration with dstack: rent and orchestrate on-demand clusters of AMD MI350X! Go Team Red!
dstack

2,921 followers
1mo

dstack 0.20.15 is out. Nice milestone for the AMD compute ecosystem: with dstack, you can now provision AMD MI350X GPUs via CloudRift, one of the first providers to offer them on-demand. Also in this release: ROCm 7.x compatibility and fixes for SSH fleets. Release notes: https://lnkd.in/gs5D4Acc
Like Comment Share
CloudRift

1,451 followers
2mo
Report this post
Want to optimize your LLM model for your hardware? Check out our latest article to see how you can run your own benchmark for free in March.

Dmitry Trifonov
2mo

I spent the past week tuning Qwen3 Coder on two prosumer GPUs, RTX 5090 and RTX PRO 6000. The optimization process: 1. Framework selection (vLLM vs SGLang -- the winner depends on the GPU and quantization) 2. Finding max context length that fits VRAM without hurting throughput 3. Sweeping max concurrent requests to find the sweet spot between throughput and latency Results: - RTX 5090 (32GB): 1,157 tok/s with 114K context on vLLM - PRO 6000 (96GB): 988 tok/s with 262K context on vLLM (up to 1,207 tok/s if you trade latency for throughput) Everything is open source: recipes, benchmarking data, and the tool I built to run these sweeps (https://lnkd.in/eWfNXCbW). --- Want to run your own benchmarks? I'm opening the infrastructure for free this month. I have GCP credits expiring in March and CloudRift GPUs available. GPUs include RTX 4090, RTX 5090, L40S, PRO 6000, H100, H200, B200, and AMD MI350X. How to participate: 1. Fork cloudrift-ai/deplodock 2. Add your recipe to experiments/ 3. Open a PR -- I'll run it and post results back Join the Discord if you want help writing recipes or have questions. Let's build a proper open benchmark dataset together. https://lnkd.in/e-u4hEzK

Optimizing Qwen3 Coder for RTX 5090 and PRO 6000 itnext.io

Like Comment Share
CloudRift

1,451 followers
2mo
Report this post
AMD Instinct MI350X are available for rental! Built for the most demanding AI Workloads: - Up to 256 compute units with CDNA 4 architecture - 288 GB HBM3E memory with up to 8 TB/s bandwidth! - 128 GB/s inter-GPU connectivity via Infinity Fabric - SR-IOV virtualization support via MxGPU — enabling secure, near-native GPU sharing across multiple VMs with hardware-level isolation - Full ROCm ecosystem support for PyTorch, JAX, and HPC frameworks Whether you're running massive LLM training, multi-tenant inference, or HPC simulations, the MI350X delivers the memory capacity and bandwidth to handle models that simply don't fit elsewhere. Rent here: https://www.cloudrift.ai/

Rent Datacenter GPUs for AI & ML • Cloudrift cloudrift.ai

4 Comments

Like Comment Share
CloudRift

1,451 followers
3mo
Report this post
Check out our latest LLM Inference benchmark for B200 / H200 / H100 / RTX PRO 6000.

Dmitry Trifonov
3mo

🫰 Who is the king of cost-effective inference? B200 vs H200 vs H100 vs RTX PRO 6000. We present an LLM inference throughput benchmark based on the vllm serve and vllm bench serve benchmarking tools, to understand the cost efficiency of the most popular NVIDIA GPUs. Pro 6000 is significantly cheaper as it has the latest Blackwell architecture, but it has slower GDDR (vs HBM) memory and lacks NVLink. B200 has the best specs, but is it worth a premium price tag? https://lnkd.in/guABDz26

Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000 medium.com

Like Comment Share
CloudRift

1,451 followers
3mo
Report this post
B200 vs H200 vs H100 vs RTX PRO 6000. Who is the king of cost-efficient inference?

Dmitry Trifonov
3mo

Not long ago NVIDIA’s Blackwell architecture has landed in datacenters with the B200 and RTX PRO 6000, promising major improvements in both performance and efficiency over the previous Hopper generation. But how do these gains translate to real-world LLM inference? I present an LLM inference throughput benchmark for RTX PRO 6000 SE vs H100 vs H200 vs B200, based on the vllm serve and vllm bench serve benchmarking tools. Pro 6000 is significantly cheaper as it has the latest Blackwell architecture, but it has slower GDDR (vs HBM) memory and lacks NVLink. B200 has much better specs than all others, but does the premium price tag make sense over more affordable alternatives? https://lnkd.in/guABDz26

Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000 medium.com

Like Comment Share

Browse jobs

Funding

CloudRift 1 total round

Last Round

Seed Apr 24, 2025

US$ 2.8M

Investors

Surface Ventures

See more info on crunchbase

CloudRift

Software Development

Santa Clara, California 1,451 followers

GPU control-plane and monetization for national telcos and regulated industries: virtualization, monetization, inference

About us

Locations

Employees at CloudRift

Dmitry Trifonov

Natalia Trifonova

Heiko Polinski

Daksh Kaushik

Updates

Join now to see what you are missing

Similar pages

seastar Consulting

Flow

RoadVision AI

Indika AI

ODF

HashedTokens

Rablo

Analysed.ai

Ascend

OpenBootcamp

Browse jobs

Support Engineer jobs

Engineer jobs

Senior Embedded Engineer jobs

Linux Engineer jobs

Embedded Software Engineer jobs

Java Software Engineer jobs

Senior Software Engineer jobs

Software Engineer jobs

Intern jobs

Developer jobs

Analyst jobs

Chief Sales Officer jobs

Funding