Ref: https://www.xiaoyuzhoufm.com/episode/688cc1cc8e06fe8de7d920cd?s=eyJ1IjoiNjE4M2U4NGJlMGY1ZTcyM2JiNjgzZDViIn0%3D

The world of Artificial Intelligence often highlights groundbreaking models and dazzling applications. But beneath the surface, a critical, often-overlooked component is driving this revolution: AI Infrastructure (AI Infra). Just as a powerful engine is essential for a high-performance car, robust AI Infra is the backbone of cutting-edge AI.

Understanding AI Infra: A Cloud Analogy

At its core, AI Infra can be understood through an analogy with cloud computing:

  • Hardware: Think of AI chips (like GPUs) as the fundamental building blocks.
  • Software: This mirrors cloud services, encompassing:
    • Infra as a Service (IaaS): Providing raw compute, storage, and networking.
    • Platform as a Service (PaaS): Offering development platforms and tools.
    • Software as a Service (SaaS): This includes training and inference frameworks, which are crucial for developing and deploying AI models.

Similar to the early days of search engine infrastructure, building AI Infra requires immense initial computational power and highly specialized capabilities. To achieve excellence in AI, core infrastructure is indispensable. We are currently in a pivotal moment for AI Infra development.

AI Infra vs. Traditional Infra: Similarities and Key Differences

Many fundamental characteristics of AI Infra are shared with traditional infrastructure, such as computation, communication, storage, and fault tolerance. However, key distinctions emerge:

  • Hardware Platforms: AI workloads heavily rely on specialized hardware like GPUs and CPUs, demanding different communication requirements.
  • High Customization: AI Infra often requires a much higher degree of customization to optimize for specific AI models and tasks.

Unlike rapidly evolving algorithms, Infra inherently emphasizes accumulation and long-term development. Many performance metrics are similar, such as the “first token wait” in AI serving, which is analogous to the load time of a traditional application. Training, in particular, shares similarities with big data processing paradigms like MapReduce.

Strategic Investment in AI Infra

For companies, strategic investment in AI Infra is crucial:

  • Large-Scale GPU Clusters: Optimizing GPU utilization can lead to enormous cost savings, dwarfing human resource costs.
  • Smaller-Scale Operations: Investment decisions should be tailored to the specific computational needs and scale of the business.

AI Infra itself is becoming increasingly specialized. To thrive, Infra companies need a vertical focus, either by optimizing for specific hardware or specializing in particular model types. AI Infra professionals need to understand both sides of the coin – becoming “more knowledgeable about hardware than model developers, and more knowledgeable about models than hardware engineers.” Close collaboration and vertical integration with upstream and downstream partners are essential.

The Impact of Infra on Model Performance

AI Infra directly influences model effectiveness. Higher efficiency in training and iteration, enabled by superior Infra, allows for faster progress with the same computational resources. However, defining success metrics is complex. Metrics like MFU (Model FLOPs Utilization) don’t always capture the desired outcomes, which can vary widely – from inference speed to training speed. In Reinforcement Learning (RL), for instance, inference and training are interwoven. The industry is still grappling with agreeing on a single, universally accepted key metric for AI Infra performance.

Collaboration Between Infra and Algorithms

The most effective collaboration occurs when Infra and algorithm teams operate as a single, small unit, jointly discussing trade-offs. In larger organizations, if Infra is merely seen as a cost-reduction center, its impact can be limited. Ideal leadership for AI efforts requires a deep understanding of both Infra and algorithms. Organizational structure is paramount, as many legacy structures are ill-suited for the demands of the AI era. For example, Product Managers should possess a strong grasp of large model technology, and model architecture decisions shouldn’t be solely the domain of algorithm engineers.

Current Challenges in AI Infra

Several challenges demand breakthroughs in AI Infra:

  • RL Engineering Complexity: The inherent complexity of Reinforcement Learning engineering presents significant hurdles.
  • Model-Hardware Co-design: The intricate interplay between models and hardware requires an iterative, “walk-one-step-at-a-time” approach.

Insights from the Silicon Valley Landscape

A quick look at Silicon Valley reveals some trends:

  • Snowflake: Commercially successful due to its data ingress management, though its core technology might not be considered “strongest.”
  • Databricks: Possesses stronger core technology, with commercialization similar to Snowflake.
  • Other Startups: Many focus on inference, likely because training technology is often proprietary and difficult to offer as a third-party service.

The Impact of Open-Source Models

Open-source models have had a mixed impact on AI Infra:

  • Positive: They have encouraged extreme optimization, pushing the boundaries of Infra technology.
  • Negative: They can lead to over-concentration in specific areas. For instance, optimizations for Llama models might not be directly transferable to others like DeepSeek.

The US-China Gap in AI Infra

A significant scale gap exists between China and the United States in AI Infra. The US benefits from greater funding and talent, enabling more comprehensive co-design across the entire stack, from models down to chips. The “shocking moment” of model advancements starting in 2022, particularly the rise of RL, has also placed new demands on RL Infra.

Advice for AI Infra Professionals

For those working in AI Infra, the advice is clear:

  • Specialize: Gravitate towards either the model side or the hardware side.
  • Passion and Co-design: Cultivate genuine interest and commit to co-design efforts.

Finally, the wisdom encapsulated in “The Bitter Lesson” (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) continues to increase in relevance for the AI Infra community.