AI Infrastructure · Developer Tooling
NVIDIA GTC 2026: Vera Rubin Platform, Dynamo 1.0, and the Agentic Scaling Phase
Seven co-designed chips, five rack types, and a new inference operating system — Jensen Huang reframes AI factories as the unit of compute for the agentic era
NVIDIA's annual GTC developer conference ran March 16–19 in San Jose, and Jensen Huang's keynote contained a density of announcements that warrants careful disaggregation. The headline hardware announcement was the Vera Rubin platform — the successor to Blackwell, architected from the ground up for inference and agentic workloads rather than the training throughput that characterised earlier GPU generations. The platform comprises seven co-designed chips in full production: the Rubin GPU, the Vera CPU, the Groq 3 LPU (from the $20 billion Groq acquisition completed in December), the NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and the Spectrum-6 Ethernet switch. The NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs on NVLink 6, and achieves 10× higher inference throughput per watt at one-tenth the cost per token versus Blackwell — a figure that, if it holds under production workloads, fundamentally alters the economics of large-context LLM serving.
The Groq 3 LPU integration is the most architecturally distinctive element of the platform. The chip carries 500 MB of on-die SRAM and achieves 150 TB/s of internal bandwidth — a design philosophy directly opposed to NVIDIA's HBM-centric memory architecture. Where HBM stacks provide high external bandwidth to a GPU die that caches only a fraction of model weights on-die, Groq's LPU architecture aims to hold the entire KV cache for a running context in on-chip SRAM, eliminating the HBM round-trip for cache reads during auto-regressive decoding. In practice this matters most for long-context inference — the multi-step reasoning chains that define agentic workflows — where the KV cache for a 128k-token context window at model weights precision can exceed the on-chip SRAM capacity of conventional GPUs, forcing expensive HBM accesses on every decode step. NVIDIA is positioning the Groq 3 LPX rack deployed alongside the NVL72 as a specialised decode accelerator, with the NVL72 handling prefill.
$1T
Jensen Huang's projected Blackwell + Vera Rubin purchase orders through 2027
10×
NVL72 inference throughput per watt vs Blackwell
150 TB/s
Groq 3 LPU internal bandwidth
336B
Rubin GPU transistor count, with 288 GB HBM4
On the software side, NVIDIA announced general availability of Dynamo 1.0, described as an "operating system for AI factories." The framing is deliberate: Dynamo manages the scheduling, routing, and resource allocation of inference requests across a heterogeneous pool of NVL72, Groq LPX, and Vera CPU racks, applying disaggregated prefill-decode scheduling, KV cache migration between racks, and SLO-aware request routing. The KV cache migration capability — moving cached context representations from one physical rack to another without re-computing the prefill — is the critical enabling mechanism for multi-agent workflows where different specialist sub-agents may be hosted on different hardware. Without cross-rack KV cache portability, every agent hand-off would require a full prefill re-compute, incurring both latency and cost that make complex multi-agent pipelines economically prohibitive at scale.
For developers building on top of the Vera Rubin platform, the practical entry point is the updated NVIDIA Agent Toolkit and the NemoClaw integration with the OpenClaw agent platform. OpenClaw — the open-source autonomous agent framework that surged in early 2026 — is now receiving first-party NVIDIA support through the NemoClaw stack, including CUDA-accelerated tool execution and sandboxed sub-agent orchestration. Jensen Huang also previewed Feynman, the next-generation architecture beyond Vera Rubin, though details were sparse — largely a roadmap signal that NVIDIA's hardware cadence extends well into 2028. Among the first cloud providers expected to deploy Vera Rubin NVL72 instances later in 2026 are AWS, Google Cloud, Microsoft Azure, and OCI.
Developer Tooling · AI Coding
Next.js 16.2, Cursor Composer 2, and the OpenAI–Astral Acquisition: Three Signals in the AI Coding Stack
An 87% faster dev server, a fine-tuned Kimi K2.5 coding model priced 86% below its predecessor, and a Python toolchain acquisition that mirrors Anthropic's Bun play
Three announcements this week collectively define the current competitive topology of AI-assisted software development. Taken individually, each is a meaningful product update. Read together, they reveal a structural pattern: every layer of the developer toolchain — the frontend framework, the IDE-native coding model, and the runtime/tooling substrate — is being pulled simultaneously toward tighter AI integration, lower inference cost, and vertical control by the major AI labs.
Next.js 16.2, released on March 18, leads with a number that earns its headline: dev server startup is roughly 87% faster than 16.1, translating to approximately a 4× time-to-URL improvement on the default application. The mechanism is Turbopack becoming lazier about compilation in the productive sense — deferring module compilation until the browser requests a specific route, rather than eagerly compiling the entire dependency graph on startup. For large Next.js applications with hundreds of routes and heavy server component trees, this makes the difference between a 30-second wait and a 4-second iteration loop. The rendering side of the release is equally significant: a contribution to React itself replaces the JSON.parse reviver approach to Server Components payload deserialisation with a two-step plain parse plus JavaScript walk, eliminating costly C++/JS boundary crossings and delivering 25–60% faster HTML rendering in real applications. The improvement is non-breaking and requires no configuration changes — upgrading the Next.js and React versions is sufficient.
npx @next/codemod@canary upgrade latest
npx create-next-app@latest my-app
The AI improvements in 16.2 deserve specific attention: the release introduces AGENTS.md scaffolding in create-next-app (providing structured context that coding agents can consume when modifying the project), browser log forwarding to the development terminal (so an agent running in a headless context can observe client-side errors without browser access), and experimental Next.js DevTools MCP integration for AI agents to query route metadata, component trees, and build diagnostics directly. These are not cosmetic additions. They represent Vercel's architectural position that the primary user of the development server is increasingly an autonomous coding agent rather than a human developer, and that the tooling surface should be designed accordingly.
Cursor's release of Composer 2 on March 19 is the most direct competitive statement of the week against OpenAI and Anthropic. The model is a fine-tuned variant of the open-source Kimi K2.5 architecture, trained through continued pretraining on code-only data followed by reinforcement learning on long-horizon multi-step coding tasks. The results on standard benchmarks are measured but real: 61.3 on CursorBench, 61.7 on Terminal-Bench 2.0 (surpassing Claude Opus 4.6's 58.0 and Claude Opus 4.5's 52.1), and 73.7 on SWE-bench Multilingual. GPT-5.4 still leads Terminal-Bench 2.0 at 75.1 in its highest configuration, and Cursor makes no claim to universal superiority — the pitch is capability per dollar. Composer 2 Standard costs $0.50/$2.50 per million input/output tokens, versus the predecessor Composer 1.5 at $3.50/$17.50 — an 86% price reduction on both input and output that repositions Composer 2 as the default cost-efficient option for Cursor's one million daily active users.
| Model |
CursorBench |
Terminal-Bench 2.0 |
SWE-bench Multilingual |
Input $/M tok |
| GPT-5.4 (high) |
— |
75.1 |
— |
~$15.00 |
| Composer 2 Fast |
61.3 |
61.7 |
73.7 |
$1.50 |
| Claude Opus 4.6 |
— |
58.0 |
— |
~$15.00 |
| Composer 2 Standard |
61.3 |
61.7 |
73.7 |
$0.50 |
| Composer 1.5 |
44.2 |
47.9 |
65.9 |
$3.50 |
The third announcement — OpenAI's agreement to acquire Astral, the company behind uv, Ruff, and ty — is the most structurally interesting for the Python ecosystem's long-term governance. Astral's toolchain has, in roughly two years, displaced pip, virtualenv, pyenv, flake8, black, isort, and mypy as the default choice for new Python projects, accumulating hundreds of millions of downloads per month in the process. The acquisition mirrors Anthropic's December 2025 acquisition of Bun (the JavaScript runtime that powers Claude Code's CLI), though the strategic rationale differs somewhat: where Bun was a core runtime dependency of Claude Code that Anthropic brought in-house to guarantee maintenance continuity, Astral represents a bid by OpenAI to embed its Codex agent directly into the dependency management, linting, and type-checking steps of every Python project — not merely the code generation step. The deal terms were not disclosed and regulatory approval is pending; OpenAI has committed to keeping Astral's open-source tools actively maintained post-close.
The pattern is now clear: Anthropic acquires Bun (JavaScript runtime), OpenAI acquires Astral (Python toolchain). The coding agent wars are being fought not just at the model layer but at the developer infrastructure layer — whoever owns the toolchain owns the agent's eyes and hands inside the codebase.
Engineering Culture · AI Agents
Stripe's Minions Ship 1,300 Pull Requests per Week: What Unattended Coding Agents Look Like in Production
One-shot sandboxed agents, sub-10-second devbox spin-up, and human review as the only gate — Stripe's architecture for autonomous software development at scale
Stripe disclosed this week that its internal AI coding agents, called Minions, are now generating over 1,300 pull requests per week, with every line of code written by the agents and every PR reviewed by a human before merge. The disclosure is notable not for the volume — which will seem conservative in a few years — but for the specific architectural choices Stripe made in deploying unattended agents at this scale in a payments infrastructure context where a single defective PR can affect millions of transactions.
Minions are explicitly not interactive copilots. The design pattern is one-shot and asynchronous: an engineer submits a task specification through Slack, a CLI, or a web interface, and the agent takes complete ownership of the task from that point — reading the relevant codebase context, writing the implementation, generating tests, producing documentation, and opening a pull request — without further engineer interaction during execution. Each Minion runs in an isolated, pre-warmed devbox that spins up in under 10 seconds. The pre-warming is architecturally significant: it means the agent environment is not provisioned on demand (which would introduce cold-start latency that disrupts the async workflow) but is maintained in a ready state, implying Stripe is carrying a standing pool of idle compute as an operational cost of the system. The isolation guarantee is equally important in a payments context: each agent session has no network access beyond approved internal endpoints, preventing a compromised or hallucinating agent from exfiltrating code or making unauthorised external calls.
The human review gate — the only gate — is a deliberate architectural choice that reflects a specific theory of where human judgment adds value in an agentic code production pipeline. Stripe is not using automated test suites as a merge gate beyond what would be applied to any human-authored PR. The implicit claim is that experienced human code reviewers, reviewing a complete PR with full diff context, can reliably detect the quality and safety issues that matter in a payments codebase — and that adding further automated review stages would introduce review latency that undermines the productivity case for agents without catching failure modes that reviewers miss. This is a reasonable position for a company with Stripe's engineering culture, but it transfers poorly to organisations where reviewer quality is more variable or where the domain (security-critical cryptography, regulatory compliance code) requires specialist knowledge that reviewers may not reliably possess.