[LIVE] [sergeykrin9/nvidia-nim-cascade] 2 endpoint(s)
ChatGPT
API Proxy/ChatGPT
2,187 characters
Status: [LIVE]
Endpoints found:
https://integrate.api.nvidia.com/v1/chat/completions
https://integrate.api.nvidia.com/v1/chat
Source: https://github.com/sergeykrin9/nvidia-nim-cascade
<img src="https://capsule-render.vercel.app/api?type=waving&color=0:1a1a1d,50:76B900,100:1a1a1d&height=200§ion=header&text=NVIDIA%20NIM%20Cascade&fontSize=48&fontColor=f5f1e8&fontAlignY=38&desc=9-model%20fallback%20chain%20·%20~360%20RPM%20·%20$0/month%20·%20Replaces%20OpenAI%20fallback&descSize=14&descAlignY=62" alt="NVIDIA NIM Cascade" />
<div align="center">
<p>
<a href="https://github.com/SergeyKrin9/nvidia-nim-cascade/stargazers"><img src="https://img.shields.io/github/stars/SergeyKrin9/nvidia-nim-cascade?style=for-the-badge&color=76B900&logo=github&labelColor=1a1a1d"></a>
<img src="https://img.shields.io/badge/license-MIT-76B900?style=for-the-badge&labelColor=1a1a1d">
<img src="https://img.shields.io/badge/Python-3.10%2B-76B900?style=for-the-badge&logo=python&logoColor=white&labelColor=1a1a1d">
<img src="https://img.shields.io/badge/NVIDIA%20NIM-free%20tier-76B900?style=for-the-badge&logo=nvidia&logoColor=white&labelColor=1a1a1d">
</p>
<sub>🐍 Python · ⚡ ~360 RPM · 💸 $0/mo · 🔁 9-model fallback · ✅ OpenAI-compatible API</sub>
</div>
> [!IMPORTANT]
> **Key insight that saved us $1,500/month.**
> NVIDIA NIM rate-limits **per MODEL, not per ACCOUNT**. Same endpoint, different `model` field = independent 40 RPM buckets. 9 different models → ~360 RPM at zero cost. Replaces OpenAI gpt-4o-mini fallback in production with zero downtime.
## The insight
Most teams that hit `meta/llama-3.1-70b-instruct` quickly run into 429 rate limits. They assume the limit is **per account**. It isn't.
**NVIDIA NIM enforces rate limits per-model.** Hitting the same OpenAI-compatible endpoint with a different `model` field gives you a fresh 40 RPM bucket. Less-popular slugs (Nemotron, gpt-oss-20b, mixtral) almost never 429.
So: a sequential cascade across 9 model slugs gives you **~360 RPM of free LLM capacity** with identical OpenAI-compatible code.
## Verified working models (May 2026)
API endpoint: `https://integrate.api.nvidia.com/v1` (Open