Facebook: Reimagining Our Infrastructure for the AI Age

Our artificial intelligence (AI) compute needs will grow dramatically over the next decade as we break new ground in AI research, ship more cutting-edge AI applications and experiences across our family of apps, and build our long-term vision of the metaverse.

We are executing on an ambitious plan to build the next generation of Meta’s AI infrastructure and today, we’re sharing some details on our progress.

This includes our first custom silicon chip for running AI models, a new AI-optimized data center design and the second phase of our 16,000 GPU supercomputer for AI research. These efforts — and additional projects still underway — will enable us to develop larger, more sophisticated AI models and then deploy them efficiently at scale. AI is already at the core of our products, enabling better personalization, safer and fairer products, and richer experiences while also helping businesses reach the audiences they care about most.

We’re even reimagining how we code by deploying CodeCompose, a generative AI-based coding assistant we developed to make our developers more productive throughout the software development lifecycle.

By rethinking how we innovate across our infrastructure, we’re creating a scalable foundation to power emerging opportunities in areas like generative AI and the metaverse.

AI at the Heart of Our Infrastructure

Since breaking ground on our first data center back in 2010, we’ve built a global infrastructure that currently serves as the engine for the more than three billion people who use our family of apps every day. AI has been an important part of these systems for many years, from our Big Sur hardware in 2015 to the development of PyTorch to and our supercomputer for AI research.

Now, we’re advancing our infrastructure in exciting new ways:

MTIA (Meta Training and Inference Accelerator): This is our in-house, custom accelerator chip family targeting inference workloads. MTIA provides greater compute power and efficiency than CPUs, and it is customized for our internal workloads. By deploying both MTIA chips and GPUs, we’ll deliver better performance, decreased latency, and greater efficiency for each workload.
Next-Gen Data Center: Our next-generation data center design will support our current products while enabling future generations of AI hardware for both training and inference. This new data center will be an AI-optimized design, supporting liquid-cooled AI hardware and a high-performance AI network connecting thousands of AI chips together for data center-scale AI training clusters. It will also be faster and more cost-effective to build, and it will complement other new hardware such as our first in-house-developed ASIC solution, MSVP, which is designed to power the constantly growing video workloads at Meta.
Research SuperCluster (RSC) AI Supercomputer: Meta’s RSC, which we believe is one of the fastest AI supercomputers in the world, was built to train the next generation of large AI models to power new augmented reality tools, content understanding systems, real-time translation technology and more. It features 16,000 GPUs, all accessible across the 3-level Clos network fabric that provides full bandwidth to each of the 2,000 training systems.

The Benefits of an End-to-End Integrated Stack

Custom-designing much of our infrastructure enables us to optimize an end-to-end experience from the physical layer to the virtual layer to the software layer to the actual user experience.

We design, build and operate everything — from the data centers to the server hardware to the mechanical systems that keep everything running. Because we control the stack from top to bottom, we’re able to customize it for our specific needs. For example, we can easily collocate GPUs, CPUs, network and storage if it will better support our workloads. If that means we need different power or cooling solutions as a result, we can rethink those designs as part of one cohesive system.

This will be increasingly important in the years ahead. Over the next decade, we’ll see increased specialization and customization in chip design, purpose-built and workload-specific AI infrastructure, new systems and tooling for deployment at scale, and improved efficiency in product and design support. All of this will deliver increasingly sophisticated models built on the latest research — and products that give people around the world access to this emerging technology.

We’re always focused on delivering long-term value and impact to guide our infrastructure vision. We believe our track record of building world-class infrastructure positions us to continue leading in AI over the next decade and beyond.

Learn more about our AI investments.

Source