NVIDIA shipped Nemotron 3 in three sizes: Nano (30B), Super (120B), and Ultra (550B). All three are open-weight MoE models trained for tool use and long context. Tarsk already lists Nemotron 3 endpoints from nine providers, so you can point an agent at one without writing a custom integration.
The model line
Nemotron 3 targets agent workloads: multi-step tool calls, long files in context, retries when a command fails. NVIDIA released weights, training data (where redistribution is allowed), and recipes under the Open Model License. You can self-host or call a hosted endpoint.
The backbone mixes Mamba state-space layers with Transformer attention in a Mixture-of-Experts layout. Each forward pass activates a small slice of total parameters. Nano runs about 3B of 30B. Ultra runs about 55B of 550B.
Nano, Super, Ultra
- Nano (30B total, ~3B active) fits tight inference budgets. NVIDIA reports 4× higher throughput than Nemotron 2 Nano on comparable hardware. Good fit for debug loops, summarization, and retrieval-heavy agents.
- Super (120B total, ~12B active) adds LatentMoE routing, Multi-Token Prediction, and NVFP4 training. NVIDIA positions it for multi-agent setups like ticket routing and IT automation.
- Ultra (550B total, ~55B active) landed at Computex 2026. Same MoE tricks at frontier scale. NVIDIA cites up to 5× throughput versus the prior generation and roughly 30% lower operating cost on Blackwell hardware.
All three tiers accept up to 1M tokens of context. Checkpoints and training artifacts are on Hugging Face. Super and Ultra also ship NVFP4-quantized variants tuned for Blackwell GPUs.
What changes for your agents
Chat billing counts tokens per turn. Agent billing counts tokens across every tool call, planning step, and retry. Nemotron 3 optimizes for that second pattern.
- MoE routing limits active parameters per token so a 120B model does not behave like a dense 120B model at inference time.
- Multi-Token Prediction generates several tokens per step, which speeds long outputs and supports speculative decoding without a separate draft model.
- Mamba layers scale context length closer to linear cost than standard attention, which matters when you paste a whole repo into the prompt.
- NeMo Gym and NeMo RL give you post-training environments if you need a domain-specific agent on top of the base checkpoints.
Providers in Tarsk
Tarsk syncs provider catalogs on a schedule. When OpenRouter or Together AI lists a new Nemotron endpoint, it shows up under Settings → Models without waiting for a Tarsk release.
First-party and local
- NVIDIA at build.nvidia.com: Nano, Super, Ultra, plus content-safety and embedding variants.
- Ollama Cloud:
nemotron-3-nano:30bandnemotron-3-super.
Ultra (recent catalog additions)
- OpenRouter:
nvidia/nemotron-3-ultra-550b-a55b, a free tier, and a content-safety model. - Together AI:
nvidia/nemotron-3-ultra-550b-a55b. - OpenCode Zen:
nemotron-3-ultra-free(replaced the deprecated Super free tier).
Nano and Super
- Kilo Gateway: Nano 30B, Super 120B (paid and free), Nano Omni reasoning.
- NanoGPT: Super with a thinking variant, Nano, Nano Omni reasoning.
- Nebius Token Factory: Nano Omni, Super 120B, Nano 30B.
- Cortecs:
nemotron-3-super-120b-a12b.
Try it in Tarsk
- Add a provider key under Settings → Providers (NVIDIA, OpenRouter, or Together AI are the fastest paths to Ultra).
- Open Settings → Models, search
nemotron, enable the tier you want. - Pick that model in your thread and send a task with tools turned on.
Start with Nano if you care about cost per step. Move to Super for production agent fleets. Use Ultra when a task needs long reasoning chains and you have the budget for 550B-class inference.
Summary
Run a Nemotron agent
Download Tarsk, add a provider key, and enable a Nemotron model in your next thread.