Semiconductor Fabrication · Lithography
imec Takes Delivery of ASML EXE:5200 High-NA EUV — Sub-2nm R&D Moves to European Soil
The $400M EXE:5200 at imec Leuven is one of fewer than a dozen worldwide; full qualification targeted for Q4 2026, enabling A14-node process development
Belgian chip research institute imec confirmed delivery of ASML's EXE:5200 High-NA EUV lithography system this month — a machine that costs roughly $400 million and is one of fewer than a dozen units in existence globally. The acquisition is part of imec's five-year strategic partnership with ASML, backed by EU Chips Act funding (IPCEI and NanoIC pilot line programs) and the Flemish and Dutch governments. Full system qualification is targeted for Q4 2026, after which imec's global partner ecosystem — comprising essentially every major fabless design house and memory manufacturer — will have access to sub-2nm and advanced DRAM process development on European territory for the first time.
The physics of High-NA EUV requires some unpacking to appreciate why this system is categorically different from the 0.33 NA EUV tools that enabled 7nm through 2nm production. Numerical aperture in an optical system governs the finest resolvable feature through the Rayleigh criterion: R = k₁ · λ / NA, where λ for EUV is 13.5 nm and k₁ is a process-dependent constant typically around 0.28 for dense line/space patterns. At 0.33 NA, this resolves to approximately 11.5 nm half-pitch per exposure — which is why 3nm and 2nm nodes required multi-patterning to achieve tighter pitches. At 0.55 NA (the EXE:5200), the resolution improves to approximately 8 nm half-pitch in single exposure, eliminating two to three multi-patterning steps per critical layer. Each eliminated patterning step reduces overlay error accumulation, simplifies process flow, and improves yield by removing one opportunity for misalignment-induced shorts or opens.
The EXE:5200's 0.55 NA achieves ~8 nm half-pitch per single exposure — vs 13 nm at 0.33 NA — eliminating two to three multi-patterning steps per critical layer and resetting the yield baseline for sub-2nm logic.
The engineering challenges introduced by the higher NA are non-trivial. The larger optical cone angle means the anamorphic projection optics must use a 4:1 reduction ratio in one axis and 8:1 in the other — requiring new reticle stitching infrastructure and resizing the maximum printable field from 26×33 mm² (standard) to 26×16.5 mm². The smaller field requires more die exposures per wafer, partially offsetting the throughput gains from eliminated multi-patterning steps. Additionally, mask-induced shift (MIS) and aberration uniformity across the anamorphic field demand mask fabrication tolerances an order of magnitude tighter than current EUVL masks — a supply chain bottleneck that ASML CEO Christophe Fouquet has acknowledged as the primary gating factor for high-volume manufacturing entry in 2027–2028.
Intel's 14A node is the first commercial high-volume target for High-NA, building on the EXE:5200B acceptance testing completed at the D1X Oregon fab in December 2025. Samsung is deploying its second High-NA unit in the first half of 2026 for 2nm (SF2) and advanced HBM4 DRAM patterning. TSMC, notably, has chosen to defer High-NA adoption, opting instead to push 0.33 NA EUV to its limits through enhanced multi-patterning for the A14 (1.4nm) node. The economic rationale is clear: TSMC's 2nm wafer pricing is already at $30,000 — a 10–20% premium over 3nm — and adding High-NA capital expenditure (at $380M+ per system, requiring multiple units per production bay) would compress margins or force further customer price increases. Intel and Samsung, both seeking to close the foundry gap with TSMC, are making a different calculation: that process density leadership justifies the equipment cost premium.
| System |
NA |
Min. Pitch (single exposure) |
Field Size |
Status |
| TWINSCAN NXE:3800 |
0.33 |
~13 nm |
26 × 33 mm² |
HVM (3nm–2nm) |
| TWINSCAN EXE:5000 |
0.55 |
~8 nm |
26 × 16.5 mm² |
R&D at Intel, Samsung, TSMC |
| TWINSCAN EXE:5200B |
0.55 |
~8 nm |
26 × 16.5 mm² |
Pre-HVM — Intel, Samsung, imec, SK hynix |
For GAA nanosheet transistors — the device architecture that replaces FinFET at 2nm and below — High-NA EUV's tighter patterning tolerances are not merely a density benefit. GAA gate formation requires patterning the internal surfaces of nanosheet stacks where overlay error compounds multiplicatively across each sheet layer. A 1 nm overlay improvement at the lithography step translates to a threshold voltage distribution tightening of roughly 15–20 mV at the device level, which cascades into reduced leakage variance and improved energy efficiency at the circuit level. The thermal design power (TDP) consequences are material: chips manufactured on a well-yielding GAA process enabled by High-NA EUV are expected to deliver a 30–40% improvement in performance-per-watt relative to FinFET 3nm — not through faster clocks, but through leakage current reduction and the ability to drive supply voltages lower while maintaining timing margin.
Memory Architecture · Heterogeneous Interconnects
UCIe 3.0, CXL 4.0, and the HBM4 Mass Production Ramp: The Three-Layer Memory Hierarchy Takes Shape
64 GT/s die-to-die links, 2 TB/s HBM4 stacks, and 100+ TB coherent CXL memory pools are converging on a disaggregated compute fabric architecture
Three interconnect and memory standards are maturing in parallel this year, and their convergence is defining the memory hierarchy for the next five years of AI and HPC infrastructure. Understanding the role each plays — and where they interface — requires separating three distinct problems: on-package bandwidth (HBM4), die-to-die integration (UCIe 3.0), and rack-scale memory disaggregation (CXL 4.0).
SK hynix's 16-layer HBM4 with 48 GB, debuted at CES 2026, extends the architecture first specified in JEDEC's JESD238 standard. The 16-layer stack achieves bandwidth exceeding 2 TB/s per stack by widening the through-silicon via (TSV) interface bus to 2048 bits and increasing signalling rate to approximately 7.2 Gbps per pin. The move from 12 to 16 layers is not purely a capacity scaling story: the additional DRAM dies provide more independent banks, allowing finer-grained burst scheduling that reduces effective access latency at the controller level by approximately 12% in random access patterns. The base die in HBM4 supports customer-specific logic integration — termed cHBM — enabling compute-in-memory operations, PIM (Processing-In-Memory) acceleration, and direct on-die attention score computation for transformer inference. Samsung is pursuing a parallel cHBM path for its Exynos 2600 and Tesla AI chip designs.
>2 TB/s
HBM4 per-stack bandwidth
64 GT/s
UCIe 3.0 data rate (2× UCIe 2.0)
128 GT/s
CXL 4.0 (PCIe 7.0 physical layer)
100+ TB
CXL 4.0 coherent memory pool per rack
UCIe 3.0, specified in mid-2025 and now in its production IP integration phase, doubles the data rate from UCIe 2.0's 32 GT/s to 64 GT/s. The protocol adds runtime recalibration — a mechanism allowing the physical layer to dynamically adjust equalisers and timing margins while the system is live, compensating for thermal drift in 3D-stacked packages where the TSV pitch is shrinking to 6 µm (hybrid bonding territory). The most strategically important feature is the UCIe DFx Architecture (UDA), a vendor-agnostic management fabric enabling real-time link telemetry across chiplets from different foundries and vendors. The "known good die" (KGD) problem — the yield risk of assembling untested chiplets into a multi-die package — is addressed through UDA's early-stage testability hooks that allow individual dies to be burned-in and characterised before bonding. AMD's MI400 GPU, using 2nm compute chiplets alongside HBM4, is the first production silicon expected to leverage UCIe 3.0's full feature set.
CXL 4.0, released by the CXL Consortium in November 2025, rides the PCIe 7.0 physical layer at 128 GT/s and introduces bundled ports that aggregate multiple physical connections into a single logical attachment delivering 1.5 TB/s. The key capability enabled by CXL 3.x/4.0 is multi-host memory pooling with cache coherency: a pool of CXL-attached DRAM in a 2-socket server can be addressed by both CPUs and by PCIe-attached accelerators as part of a coherent address space, with the CXL controller maintaining MESI-style coherence across the fabric. The production deployment timeline for multi-rack CXL memory pooling (100+ TB per rack) is late 2026–2027. Near-term deployments are focused on single-rack KV cache offloading for LLM inference, where storing KV cache in CXL-attached memory (at 4–5× lower cost than GPU VRAM) while keeping hot activations in HBM achieves 3.8–6.5× inference throughput improvements over RDMA-based approaches. PNM-KV, a processing-near-memory architecture for KV cache token page selection, demonstrates up to 21.9× throughput by co-locating selection logic with CXL memory — eliminating the CPU round-trip for cache eviction decisions.
The architectural geometry emerging from UCIe 3.0 + HBM4 + CXL 4.0 can be described as three concentric memory tiers. The innermost tier is HBM4 on-package: 2 TB/s bandwidth, ~10 ns latency, attached to the compute die through hybrid bonding at sub-6 µm pitch — essentially L4 cache semantics for AI workloads. The middle tier is CXL-attached DRAM in the same server (CXL 2.0/3.x): ~70 ns latency, ~200–400 GB/s bandwidth, providing memory expansion without cache coherency complexity for capacity-constrained inference. The outermost tier is multi-rack CXL 4.0 pooled memory: latency in the 200–500 ns range but coherent across the fabric, enabling the "memory-as-a-service" model where AI workloads' KV caches can grow to 150+ GB without GPU VRAM constraints. The research proposal to use UCIe-S instead of LPDDR6 as an on-package memory interface reports bandwidth density up to 10× higher than HBM4 alternatives at equivalent energy per bit — a result that suggests HBM's packaging model, which requires heterogeneous DRAM+logic stacking, may eventually yield to a UCIe-native memory chiplet model as hybrid bonding matures.
| Technology |
Bandwidth |
Latency |
Coherency |
Use Tier |
| HBM4 (on-package) |
>2 TB/s / stack |
~10 ns |
N/A (direct) |
L4 cache / working set |
| CXL 2.0 DRAM expander |
~100 GB/s |
~70 ns |
Host-coherent |
Memory expansion |
| CXL 3.x pooled DRAM |
~400 GB/s fabric |
~100–200 ns |
Peer-to-peer |
KV cache / model offload |
| CXL 4.0 multi-rack pool |
1.5 TB/s port |
~200–500 ns |
Full fabric coherency |
100+ TB AI memory fabric |
| DDR5-6400 (CPU-attached) |
~100 GB/s total |
~65 ns |
Full MESI |
CPU working set / OS |
One implementation signal worth watching: SK hynix's CMM-Ax product integrates computing capabilities directly into CXL memory modules — effectively embedding an FPGA-class processing element inside the memory expander. This positions CXL-attached memory not merely as a capacity tier but as an active processing resource, analogous to near-data processing (NDP) architectures explored in academic literature for the past decade. For graph neural network inference, where memory access patterns are deeply irregular and data reuse is low, co-locating selection logic with the data eliminates the bandwidth cost of moving candidate data to the GPU for evaluation — a pattern that AMD's FPGA-in-package research confirmed delivers 2–4× throughput improvement for sparse workloads.