AI infrastructure is now one of the largest and fastest-growing areas in semiconductors. Hyperscalers, cloud providers, and sovereign initiatives are building out capacity at unprecedented scale, spanning GPUs, TPUs, custom ASICs, networking, and vertically integrated platforms. The pace of change makes the market feel wide-open, but the underlying structure is becoming clearer. This infrastructure pattern applies to the Western market; China is simultaneously building a separate, self-contained AI stack centered on domestic suppliers due to export controls.

Here’s the shape that’s emerging.

GPU Platforms: Clear Leadership and a Strong Second Source

NVIDIA remains the anchor platform for AI training and inference, with a decade-deep stack that spans GPUs, interconnects, networking, libraries, compilers, and CUDA/TensorRT software. Every major cloud provider is deploying multi-generation Nvidia systems at enormous scale.

AMD has become the credible alternative. MI300/MI350-class accelerators are now widely deployed, supported by a maturing ROCm ecosystem and direct co-design work with leading AI developers and cloud providers.

This space is not undefined. It reflects years of hardware–software integration that is difficult for new entrants to replicate.

Custom AI ASICs: A Large, Established, and Now Clearly Defined Category

Custom silicon has become a central pillar of AI infrastructure, and the partnerships are crystallizing into distinct strategic blocs.

Broadcom has emerged as the dominant partner for Google’s TPU roadmap and is increasingly central to Meta’s internal silicon efforts. It is executing on multi-billion-dollar volume ramps in both compute and high-speed networking, anchoring one of the most important ASIC lanes in the industry.

Google TPU, co-developed with Broadcom, powers Google’s internal workloads and is now gaining external gravity. Reports indicate that Meta is evaluating TPU for its own clusters — a strong signal that TPU has matured into a viable cross-hyperscaler alternative to Nvidia for targeted training and inference workloads.

Marvell has carved out a distinct leadership role in custom compute and AI infrastructure. It is a key partner for AWS (Trainium/Inferentia connectivity) and Microsoft (Maia accelerator), and plays a key role in the high-speed data interconnects that enable large-scale AI clusters to communicate efficiently.

MediaTek is triangulating a unique position. It is contributing to next-generation TPU development with Google, leveraging cost-optimized engineering execution. In parallel, it benefits from a broader architectural alignment across Arm and Nvidia (via emerging NVLink interoperability paths), improving system-level integration across parts of the ecosystem.

Non-GPU accelerators such as Groq continue to play targeted roles in low-latency inference and high-throughput streaming workloads, though at smaller scale relative to the major platform providers.

This category already has clearly defined lanes and is well populated with established players who have deep design partnerships and long-term roadmaps.

Interconnect Lanes: Integrated Fabrics vs. Open Standards

If GPUs define compute lanes, cluster interconnects define ecosystem lock-in, and this is where the lanes are perhaps most visible.

Nvidia continues to drive its vertically integrated InfiniBand and NVLink stack, treating networking and compute as a unified platform. This delivers best-in-class performance, but it requires full ecosystem commitment. A structurally closed lane.

A large coalition including AMD, Broadcom, Marvell, Microsoft, and Meta has rallied behind the Ultra Ethernet Consortium (UEC), whose objective is to make standard Ethernet performant enough for AI training and inference at scale. Where Nvidia offers maximally optimized integration, UEC offers open flexibility, enabling hyperscalers to mix and match accelerators without being tied to a proprietary interconnect model.

This is no longer just “data center plumbing.” It is a fundamental architectural choice that shapes entire AI platforms.

Internal Silicon Efforts Are Redefining the Landscape

Hyperscalers increasingly treat silicon as strategic differentiation. Amazon continues to expand Trainium and Inferentia. Google is deepening its commitment to TPU. Meta is investing heavily in purpose-built internal accelerators tailored to its own workloads.

These internal programs shrink the merchant TAM and create tighter vertical lock-in across each company’s AI stack.

In contrast, providers such as Oracle and CoreWeave are capturing massive training workloads specifically by remaining pure merchant-silicon buyers, leaning fully into neutral Nvidia and AMD infrastructure rather than competing with their customers through proprietary chips. This neutrality has become a strategic differentiator.

The Narrowing Lanes for Merchant Entrants

Given entrenched GPU platforms, established ASIC suppliers, proprietary and open interconnect ecosystems, and deep training-software lock-in, the open space for new merchant compute entrants is relatively narrow.

Inference-oriented accelerators may still succeed in targeted opportunities such as power-optimized data centers, cost-sensitive inference clusters, regional or specialized deployments, and enterprise-specific edge-to-cloud workflows.

But moving from niche deployments to global platform status requires software ecosystems, compatibility with existing training stacks, multi-generation roadmaps, and deep hyperscaler integration — all nontrivial challenges in a consolidating landscape. This dynamic applies across both commercial hyperscalers and sovereign AI deployments, which often begin multi-vendor but converge on a smaller set of leading and established platforms as workloads scale.

The Broader Takeaway

AI infrastructure is no longer a blank slate. It is an ecosystem with clear GPU platform leaders, robust and differentiated custom ASIC providers, distinct interconnect philosophies, growing internal silicon programs, and increasingly defined architectural lanes across the stack.

Multiple players will participate, but the lanes are becoming clear. Future success will depend on ecosystem fit, software depth, and the ability to integrate with the platforms that already dominate training and large-scale deployment.

Understanding where real opportunity exists — and where consolidation is already taking hold — will be essential to navigating the next decade of AI infrastructure.