The post Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters appeared first on Engineering at Meta.
]]>Once it’s complete our AI cluster, Prometheus, will deliver 1-gigawatt of capacity to enhance and enable new and existing AI experiences across Meta products. Prometheus’ infrastructure will span several data center buildings in a single larger region, interconnecting tens of thousands of GPUs.
A key piece of scaling and connecting this infrastructure is backend aggregation (BAG), which we use to seamlessly connect GPUs and data centers with robust, high-capacity networking. By leveraging modular hardware, advanced routing, and resilient topologies, BAG ensures both performance and reliability at unprecedented scale
As our AI clusters continue to grow, we expect BAG to play an important role in meeting future demands and driving innovation across Meta’s global network.
BAG is a centralized Ethernet-based super spine network layer that primarily functions to interconnect multiple spine layer fabrics across various data centers and regions within large clusters. Within Prometheus, for example, the BAG layer serves as the aggregation point between regional networks and Meta’s backbone, enabling the creation of mega AI clusters. BAG is designed to support immense bandwidth needs, with inter-BAG capacities reaching the petabit range (e.g., 16-48 Pbps per region pair).

To address the challenge of interconnecting tens of thousands of GPUs, we’re deploying distributed BAG layers regionally.
BAG layers are strategically distributed across regions to serve subsets of L2 fabrics, adhering to distance, buffer, and latency constraints. Inter-BAG connectivity utilizes either a planar (direct match) or spread connection topology, chosen based on site size and fiber availability.

So far, we’ve discussed how the BAG layers are interconnected, now let’s see how a BAG layer connects downstream to L2 fabrics.
We’ve used two main fabric technologies, Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF) to build L2 networks.
Below is an example of DSF L2 zones across five data center buildings connected to the BAG layer via a special backend edge pod in each building.

Below is an example of NSF L2 connected to BAG planes. Each BAG plane connects to matching Spine Training Switches (STSWs) from all spine planes. Effective oversubscription is 4.98:1.

Careful management of oversubscription ratios assists in balancing scale and performance. Typical oversubscription from L2 to BAG is around 4.5:1, while BAG-to-BAG oversubscription varies based on regional requirements and link capacity.
Meta’s implementation of BAG uses a modular chassis equipped with Jericho3 (J3) ASIC line cards, each providing up to 432x800G ports for high-capacity, scalable, and resilient interconnect. The central hub BAG employs a larger chassis to accommodate numerous spokes and long-distance links with varied cable lengths for optimized buffer utilization.
Routing within BAG uses eBGP with link bandwidth attributes, enabling Unequal Cost Multipath (UCMP) for efficient load balancing and robust failure handling. BAG-to-BAG connections are secured with MACsec, aligning with network security requirements.
The network design meticulously details port striping, IP addressing schemes, and comprehensive failure domain analysis to ensure high availability and minimize the impact of failures. Failure modes are analyzed at the BAG, data hall, and power distribution levels. We also employ various strategies to mitigate blackholing risks, including draining affected BAG planes and conditional route aggregation.
An important advantage of BAG’s distributed architecture is it keeps the distance from the L2 edge small, which is important for shallow buffer NSF switches. Longer, BAG-to-BAG, cable distances dictate that we use deep buffer switches for the BAG role. This provides a large headroom buffer to support lossless congestion control protocols like PFC.
As a technology, BAG is playing an important role in Meta’s next generation of AI infrastructure. By centralizing the interconnection of regional networks, BAG helps enable the gigawatt-scale Prometheus cluster, ensuring seamless, high-capacity networking across tens of thousands of GPUs. This thoughtful design, leveraging modular hardware and resilient topologies, positions BAG to not only meet the demands of Prometheus but also to drive the future innovation and scalability of Meta’s global AI network for years to come.
The post Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters appeared first on Engineering at Meta.
]]>The post No Display? No Problem: Cross-Device Passkey Authentication for XR Devices appeared first on Engineering at Meta.
]]>Passkeys are a significant leap forward in authentication, offering a phishing-resistant, cryptographically secure alternative to traditional passwords. Generally, the standard cross-device passkey flow, where someone registers or authenticates on a desktop device by approving the action on their nearby mobile device, is done in a familiar way with QR codes scanned by their phone camera. But how can we facilitate this flow for XR devices with a head-mounted display or no screen at all, or for other devices with an inaccessible display like smart home hubs and industrial sensors?
We’ve taken a novel approach to adapting the WebAuthn passkey flow and FIDO’s CTAP hybrid protocol for this unique class of devices that either lack a screen entirely or whose screen is not easily accessible to another device’s camera. Our implementation has been developed and is now broadly available on Meta Quest devices powered by Meta Horizon OS. We hope that this approach can also ensure robust security built on the strength of existing passkey frameworks, without sacrificing usability, for users of a variety of other screenless IoT devices, consumer electronics, and industrial hardware.

The standard cross-device flow relies on two primary mechanisms:
For devices with no display, the QR code method is impossible. Proximity-based discovery is feasible, but initiating the user verification step and confirming the intent without any on-device visual feedback can introduce security and usability risks. People need clear assurance that they are approving the correct transaction on the correct device.
Scanning a QR code sends the authenticator device a command to initiate a hybrid (cross-device) login flow with a nonce that identifies the unauthenticated device client. But if a user has a companion application – like the Meta Horizon app – that uses the same account as the device we can use that application to pass this same request to the authenticator OS and execute it using general link/intent execution.
We made the flow easy to navigate by using in-app notifications to show users when a login request has been initiated, take them directly into the application, and immediately execute the login request.
For simplicity, we opted to begin the hybrid flow as soon as the application is opened since the user would have had to take some action (clicking the notification or opening the app) to trigger this and there is an additional user verification step in hybrid implementations on iOS and Android.
Here’s how this plays out on a Meta Quest with the Meta Horizon mobile app:

When a passkey login is initiated on the Meta Quest, the headset’s browser locally constructs the same payload that would have been embedded in a QR Code – including a fresh ECDH public key, a session-specific secret, and routing information used later in the handshake. Instead of rendering this information into an image (QR code), the browser encodes it into a FIDO URL (the standard mechanism defined for hybrid transport) that instructs the mobile device to begin the passkey authentication flow.
After the FIDO URL is generated, the headset requires a secure and deterministic method for transferring it to the user’s phone. Because the device cannot present a QR code, the system leverages the Meta Horizon app’s authenticated push channel to deliver the FIDO URL directly to the mobile device. When the user selects the passkey option in the login dialog, the headset encodes the FIDO URL as structured data within a GraphQL-based push notification.
The Meta Horizon app, signed in with the same account as the headset, receives this payload and validates the delivery context to ensure it is routed to the correct user.
After the FIDO URL is delivered to the mobile device, the platform’s push service surfaces it as a standard iOS or Android notification indicating that a login request is pending. When the user taps the notification, the operating system routes the deep link to the Meta Horizon app. The app then opens the FIDO URL using the system URL launcher and invokes the operating system passkey interface.
For users that have the notification turned off or disabled, launching the Meta Horizon app directly will also trigger a query to the backend for any pending passkey requests associated with the user’s account. If a valid request exists (requests expire after five minutes), the app automatically initiates the same passkey flow by opening the FIDO URL.
Once the FIDO URL is opened, the mobile device begins the hybrid transport sequence, including broadcasting the BLE advertisement, establishing the encrypted tunnel, and producing the passkey assertion. In this flow, the system notification and the app launch path both serve as user consent surfaces and entry points into the standard hybrid transport workflow.
Once the user approves the action on their mobile device, the secure channel is established as per WebAuthn standards. The main difference is the challenge exchange timing:
The inaccessible device then acts as the conduit, forwarding the response to the relying party server to complete the transaction, exactly as a standard display-equipped device would.
This novel implementation successfully bypasses the need for an on-device display in the cross-device flow and still complies with the proximity and other trust challenges that exist today for cross-device passkey login. We hope that our solution paves the way for secure, passwordless authentication across a wider range of different platforms and ecosystems, moving passkeys beyond just mobile and desktop environments and into the burgeoning world of wearable and IoT devices.
We are proud to build on top of the excellent work already done in this area by our peers in the FIDO Alliance and mobile operating systems committed to this work and building a robust and interoperable ecosystem for secure and easy login.
The post No Display? No Problem: Cross-Device Passkey Authentication for XR Devices appeared first on Engineering at Meta.
]]>The post Rust at Scale: An Added Layer of Security for WhatsApp appeared first on Engineering at Meta.
]]>WhatsApp provides default end-to-end encryption for over 3 billion people to message securely each and every day. Online security is an adversarial space, and to continue ensuring users can keep messaging securely, we’re constantly adapting and evolving our strategy against cyber-security threats – all while supporting the WhatsApp infrastructure to help people connect.
For example, WhatsApp, like many other applications, allows users to share media and other types of documents. WhatsApp helps protect users by warning about dangerous attachments like APKs, yet rare and sophisticated malware could be hidden within a seemingly benign file like an image or video. These maliciously crafted files might target unpatched vulnerabilities in the operating system, libraries distributed by the operating system, or the application itself.
To help protect against such potential threads, WhatsApp is increasingly using the Rust programming language, including in our media sharing functionality. Rust is a memory safe language offering numerous security benefits. We believe that this is the largest rollout globally of any library written in Rust.
To help explain why and how we rolled this out, we should first look back at a key OS-level vulnerability that sent an important signal to WhatsApp around hardening media-sharing defenses.
In 2015, Android devices, and the applications that ran on them, became vulnerable to the “Stagefright” vulnerability. The bug lay in the processing of media files by operating system-provided libraries, so WhatsApp and other applications could not patch the underlying vulnerability. Because it could often take months for people to update to the latest version of their software, we set out to find solutions that would keep WhatsApp users safe, even in the event of an operating system vulnerability.
At that time, we realized that a cross-platform C++ library already developed by WhatsApp to send and consistently format MP4 files (called “wamedia”) could be modified to detect files which do not adhere to the MP4 standard and might trigger bugs in a vulnerable OS library on the receiver side – hence putting a target’s security at risk. We rolled out this check and were able to protect WhatsApp users from the Stagefright vulnerability much more rapidly than by depending on users to update the OS itself.
But because media checks run automatically on download and process untrusted inputs, we identified early on that wamedia was a prime candidate for using a memory safe language.

Rather than an incremental rewrite, we developed the Rust version of wamedia in parallel with the original C++ version. We used differential fuzzing and extensive integration and unit tests to ensure compatibility between the two implementations.
Two major hurdles were the initial binary size increase due to bringing in the Rust standard library and the build system support required for the diverse platforms supported by WhatsApp. WhatsApp made a long-term bet to build that support. In the end, we replaced 160,000 lines of C++ (excluding tests) with 90,000 lines of Rust (including tests). The Rust version showed performance and runtime memory usage advantages over the C++. Given this success, Rust was fully rolled out to all WhatsApp users and many platforms: Android, iOS, Mac, Web, Wearables, and more. With this positive evidence in hand, memory safe languages will play an ever increasing part in WhatsApp’s overall approach to application and user security.
Over time, we’ve added more checks for non-conformant structures within certain file types to help protect downstream libraries from parser differential exploit attempts. Additionally, we check higher risk file types, even if structurally conformant, for risk indicators. For instance, PDFs are often a vehicle for malware, and more specifically, the presence of embedded files and scripting elements within a PDF further raise risks. We also detect when one file type masquerades as another, through a spoofed extension or MIME type. Finally, we uniformly flag known dangerous file types, such as executables or applications, for special handling in the application UX. Altogether, we call this ensemble of checks “Kaleidoscope.” This system protects people on WhatsApp from potentially malicious unofficial clients and attachments. Although format checks will not stop every attack, this layer of defense helps mitigate many of them.
Each month, these libraries are distributed to billions of phones, laptops, desktops, watches, and browsers running on multiple operating systems for people on WhatsApp, Messenger, and Instagram. This is the largest ever deployment of Rust code to a diverse set of end-user platforms and products that we are aware of. Our experience speaks to the production-readiness and unique value proposition of Rust on the client-side.
This is just one example of WhatsApp’s many investments in security. It’s why we built default end-to-end encryption for personal messages and calls, offer end-to-end encrypted backups, and use key transparency technology to verify a secure connection, provide additional calling protections, and more.
WhatsApp has a strong track record of being loud when we find issues and working to hold bad actors accountable. For example, WhatsApp reports CVEs for important issues we find in our applications, even if we do not find evidence of exploitation. We do this to give people on WhatsApp the best chance of protecting themselves by seeing a security advisory and updating quickly.
To ensure application security, we first must identify and quantify the sources of risk. We do this through internal and external audits like NCC Group’s public assessment of WhatsApp’s end-to-end encrypted backups, fuzzing, static analysis, supply chain management, and automated attack surface analysis. We also recently expanded our Bug Bounty program to introduce the WhatsApp Research Proxy – a tool that makes research into WhatsApp’s network protocol more effective.
Next, we reduce the identified risk. Like many others in the industry, we found that the majority of the high severity vulnerabilities we published were due to memory safety issues in code written in the C and C++ programming languages. To combat this we invest in three parallel strategies:
WhatsApp has added protections like CFI, hardened memory allocators, safer buffer handling APIs, and more. C and C++ developers have specialized security training, development guidelines, and automated security analysis on their changes. We also have strict SLAs for fixing issues uncovered by the risk identification process.
Rust enabled WhatsApp’s security team to develop a secure, high performance, cross-platform library to ensure media shared on the platform is consistent and safe across devices. This is an important step forward in adding additional security behind the scenes for users and part of our ongoing defense-in-depth approach. Security teams at WhatsApp and Meta are highlighting opportunities for high impact adoption of Rust to interested teams, and we anticipate accelerating adoption of Rust over the coming years.
The post Rust at Scale: An Added Layer of Security for WhatsApp appeared first on Engineering at Meta.
]]>The post Adapting the Facebook Reels RecSys AI Model Based on User Feedback appeared first on Engineering at Meta.
]]>Delivering personalized video recommendations is a common challenge for user satisfaction and long-term engagement on large-scale social platforms. At Facebook Reels, we’ve been working to close this gap by focusing on “interest matching” – ensuring that the content people see truly aligns with their unique preferences. By combining large-scale user surveys with recent advances in machine learning, we are now able to better understand and model what people genuinely care about, which has led to significant improvements in both recommendation quality and overall user satisfaction.
Traditional recommendation systems often rely on engagement signals – such as likes, shares, and watch time – or heuristics to infer user interests. However, these signals can be noisy and may not fully capture the nuances of what people actually care about or want to see. Models trained only on these signals tend to recommend content that has high short-term user value measured by watch time and engagement but doesn’t capture true interests that are important for long-term utility of the product. To bridge this gap, we needed a more direct way to measure user perception of content relevance. Our research shows that effective interest matching goes beyond simple topic alignment; it also encompasses factors like audio, production style, mood, and motivation. By accurately capturing these dimensions, we can deliver recommendations that feel more relevant and personalized, encouraging people to return to the app more frequently.

To validate our approach, we launched large-scale, randomized surveys within the video feed, asking users, “How well does this video match your interests?” These surveys were deployed across Facebook Reels and other video surfaces, enabling us to collect thousands of in-context responses from users every day. The results revealed that previous interest heuristics only achieved a 48.3% precision in identifying true interests, highlighting the need for a more robust measurement framework.
By weighting responses to correct for sampling and nonresponse bias, we built a comprehensive dataset that accurately reflects real user preferences – moving beyond implicit engagement signals to leverage direct, real-time user feedback.

Daily, a certain proportion of users viewing sessions on the platform are randomly chosen to display a single-question survey asking, “To what extent does this video match your interests?” on a 1-5 scale. The survey aims to gather real-time feedback from users about the content they have just viewed.
The main candidate ranking model used by the platform is a large multi-task, multi-label model. We trained a lightweight UTIS alignment model layer on the collected user survey responses using existing predictions of the main model as input features. The survey responses used to train our model were binarized for easy modelling and denoises variance in responses. In addition, new features were engineered to capture user behavior, content attributes, and interest signals with the object function to optimize predicting users’ interest-matching extent.
The UTIS model outputs the probability that a user is satisfied with a video, and is designed to be interpretable, allowing us to understand the factors contributing to users’ interest matching experience.

We have experimented with and deployed several use cases of the UTIS model in our ranking funnel, all of which showed successful tier 0 user retention metric improvements:
The UTIS model score is now one of the inputs to our ranking system. Videos predicted to be of high interest receive a modest boost, while those with low predicted interest are demoted. This approach has led to:
Since launching this approach, we’ve observed robust offline and online performance
By integrating survey-based measurement with machine learning, we are creating a more engaging and personalized experience – delivering content on Facebook Reels that feels truly tailored to each user and encourages repeat visits. While survey-driven modeling has already improved our recommendations, there remain important opportunities for improvement, such as better serving users with sparse engagement histories, reducing bias in survey sampling and delivery, further personalizing recommendations for diverse user cohorts and improving the diversity of recommendations. To address these challenges and continue advancing relevance and quality, we are also exploring advanced modeling techniques, including large language models and more granular user representations.
Improve the Personalization of Large-Scale Ranking Systems by Integrating User Survey Feedback
.meta-btn {
background-color: #0064E0; /* Meta Blue */
color: #ffffff !important; /* Force white text */
padding: 10px 20px; /* Button size */
border: none; /* No border */
border-radius: 5px; /* Rounded corners */
cursor: pointer; /* Pointer cursor on hover */
text-decoration: none; /* Remove underline */
display: inline-block; /* Button-like display */
font-weight: bold; /* Optional: bold text */
transition: color 0.3s ease, background-color 0.3s ease; /* Smooth transitions */
}
.meta-btn:hover {
color: #808080 !important; /* Grey text on hover */
}
The post Adapting the Facebook Reels RecSys AI Model Based on User Feedback appeared first on Engineering at Meta.
]]>The post CSS at Scale With StyleX appeared first on Engineering at Meta.
]]>Build a large enough website with a large enough codebase, and you’ll eventually find that CSS presents challenges at scale. It’s no different at Meta, which is why we open-sourced StyleX, a solution for CSS at scale. StyleX combines the ergonomics of CSS-in-JS with the performance of static CSS. It allows atomic styling of components while deduplicating definitions to reduce bundle size and exposes a simple API for developers.
StyleX has become the standard at companies like Figma and Snowflake. Here at Meta, it’s the standard styling system across Facebook, Instagram, WhatsApp, Messenger, and Threads.
On this episode of the Meta Tech Podcast, meet Melissa, a software engineer at Meta and one of StyleX’s maintainers. Pascal Hartig talks to her about all things StyleX—its origins, how open source has been a force multiplier for the project, and what it’s like interacting with large companies across the industry as they’ve adopted StyleX.
Download or listen to the episode below:
You can also find the episode wherever you get your podcasts, including:
The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.
Send us feedback on Instagram, Threads, or X.
And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.
The post CSS at Scale With StyleX appeared first on Engineering at Meta.
]]>The post Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption appeared first on Engineering at Meta.
]]>The survey was initially distributed on official social media accounts by the survey creators, and subsequently shared organically across further platforms including Reddit, email newsletters, Mastodon, LinkedIn, Discord, and Twitter. When respondents were asked which platform they heard about the survey from, Reddit emerged as the most effective channel, but significant engagement also came from email newsletters and Mastodon, reflecting the diverse spaces where Python developers connect and share knowledge.
The respondent pool was predominantly composed of developers experienced with Python and typing. Nearly half reported over a decade of Python experience, and another third had between five and 10 years. While there was representation from newcomers, the majority of participants brought substantial expertise to their responses. Experience with type hints was similarly robust, with most respondents having used them for several years and only a small minority indicating no experience with typing.
The survey results reveal that Python’s type hinting system has become a core part of development for most engineers. An impressive 86% of respondents report that they “always” or “often” use type hints in their Python code, a figure that remains consistent with last year’s Typed Python survey.
For the first time this year the survey also asked participants to indicate how many years of experience they have with Python and with Python typing. We found that adoption of typing is similar across all experience levels, but there are some interesting nuances:

Overall, the data shows that type hints are widely embraced by the Python community, with strong support from engineers at all experience levels. However, we should note there may be some selection bias at play here, as it’s possible developers who are more familiar with types and use them more often are also more likely to be interested in taking a survey about it.
When asked what developers loved about the Python type system there were some mixed reactions, with a number of responses just stating, “nothing” (note this was an optional question). This indicates the presence of some strong negative opinions towards the type system among a minority of Python users. The majority of responses were positive, with the following themes emerging prominently:

In addition to assessing positive sentiment towards Python typing, we also asked respondents what challenges and pain points they face. With over 800 responses to the question, “What is the hardest part about using the Python type system?” the following themes were identified:
A little less than half of respondents had suggestions for what they thought was missing from the Python type system, the most commonly requested features being:
The developer tooling landscape for Python typing continues to evolve, with both established and emerging tools shaping how engineers work.
Mypy remains the most widely used type checker, with 58% of respondents reporting using it. While this represents a slight dip from 61% in last year’s survey, Mypy still holds a dominant position in the ecosystem. At the same time, new Rust-based type checkers like Pyrefly, Ty, and Zuban are quickly gaining traction, now used by over 20% of survey participants collectively.

When it comes to development environments, VS Code leads the pack as the most popular IDE among Python developers, followed by PyCharm and (Neo)vim/vim. The use of type checking tools within IDEs also mimics the popularity of the IDE themselves, with VS Code’s default (Pylance/Pyright) and PyCharm’s built-in support being the first and third most popular options respectively.
When it comes to learning about Python typing and getting help, developers rely on a mix of official resources, community-driven content, and AI-powered tools, a similar learning landscape to what we saw in last year’s survey.

Official documentation remains the go-to resource for most developers. The majority of respondents reported learning about Python typing through the official docs, with 865 citing it as their primary source for learning and 891 turning to it for help. Python’s dedicated typing documentation and type checker-specific docs are also heavily used, showing that well-maintained, authoritative resources are still highly valued.
Blog posts have climbed in popularity, now ranking as the second most common way developers learn about typing, up from third place last year. Online tutorials, code reviews, and YouTube videos also play a significant role.
Community platforms are gaining traction as sources for updates and new features. Reddit, in particular, has become a key channel for discovering new developments in the type system, jumping from fifth to third place as a source for news. Email newsletters, podcasts, and Mastodon are also on the rise.
Large language models (LLMs) are now a notable part of the help-seeking landscape. Over 400 respondents reported using LLM chat tools, and nearly 300 use in-editor LLM suggestions when working with Python typing.
The 2025 Python Typing Survey highlights the Python community’s sustained adoption of typing features and tools to support their usage. It also points to clear opportunities for continued growth and improvement, including:
To learn more about Meta Open Source, visit our website, subscribe to our YouTube channel, or follow us on Facebook, Threads, X, Bluesky and LinkedIn.
This survey ran from 29th Aug to 16th Sept 2025 and received 1,241 responses in total.
Thanks to everyone who participated! The Python typing ecosystem continues to evolve, and your feedback helps shape its future.
Also, special thanks to the Jetbrains PyCharm team for providing the graphics used in this piece.
The post Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption appeared first on Engineering at Meta.
]]>The post DrP: Meta’s Root Cause Analysis Platform at Scale appeared first on Engineering at Meta.
]]>DrP is a root cause analysis (RCA) platform, designed by Meta, to programmatically automate the investigation process, significantly reducing the mean time to resolve (MTTR) for incidents and alleviating on-call toil
Today, DrP is used by over 300 teams at Meta, running 50,000 analyses daily, and has been effective in reducing MTTR by 20-80%
By understanding DrP and its capabilities, we can unlock new possibilities for efficient incident resolution and improved system reliability.
DrP is an end-to-end platform that automates the investigation process for large-scale systems. It addresses the inefficiencies of manual investigations, which often rely on outdated playbooks and ad-hoc scripts. These traditional methods can lead to prolonged downtimes and increased on-call toil as engineers spend countless hours triaging and debugging incidents.
DrP offers a comprehensive solution by providing an expressive and flexible SDK to author investigation playbooks, known as analyzers. These analyzers are executed by a scalable backend system, which integrates seamlessly with mainstream workflows such as alerts and incident management tools. Additionally, DrP includes a post-processing system to automate actions based on investigation results, such as mitigation steps.

DrP’s key components include:

The process of creating automated playbooks, or analyzers, begins with the DrP SDK. Engineers enumerate the investigation steps, listing inputs and potential paths to isolate problem areas. The SDK provides APIs and libraries to codify these workflows, allowing engineers to capture all required input parameters and context in a type-safe manner.
Once created, analyzers are tested and sent for code review. DrP offers automated backtesting integrated into code review tools, ensuring high-quality analyzers before deployment.
In production, analyzers integrate with tools like UI, CLI, alerts, and incident management systems. Analyzers can automatically trigger upon alert activation, providing immediate results to on-call engineers and improving response times. The DrP backend manages a queue for requests and a worker pool for secure execution, with results returning asynchronously.
DrP has demonstrated significant improvements in reducing MTTR across various teams and use cases. By automating manual investigations, DrP enables faster triage and mitigation of incidents, leading to quicker system recovery and improved availability.
The automation provided by DrP reduces the on-call effort during investigations, saving engineering hours and reducing on-call fatigue. By automating repetitive and time-consuming steps, DrP allows engineers to focus on more complex tasks, improving overall productivity.
DrP has been successfully deployed at scale at Meta, covering over 300 teams and 2000 analyzers, executing 50,000 automated analyses per day. Its integration into mainstream workflows, such as alerting systems, has facilitated widespread adoption and demonstrated its value in real-world scenarios.
Looking ahead, DrP aims to evolve into an AI-native platform, playing a central role in advancing Meta’s broader AI4Ops vision, enabling more powerful and automated investigations. This transformation will enhance analysis by delivering more accurate and insightful results, while also simplifying the user experience through streamlined ML algorithms, SDKs, UI, and integrations facilitating effortless authoring and execution of analyzers.
DrP: Meta’s Efficient Investigations Platform at Scale
We wish to thank contributors to this effort across many teams throughout Meta
Team – Eduardo Hernandez, Jimmy Wang, Akash Jothi, Kshitiz Bhattarai, Shreya Shah, Neeru Sharma, Alex He, Juan-Pablo E, Oswaldo R, Vamsi Kunchaparthi, Daniel An, Rakesh Vanga, Ankit Agarwal, Narayanan Sankaran, Vlad Tsvang, Khushbu Thakur, Srikanth Kamath, Chris Davis, Rohit JV, Ohad Yahalom, Bao Nguyen, Viraaj Navelkar, Arturo Lira, Nikolay Laptev, Sean Lee, Yulin Chen
Leadership – Sanjay Sundarajan, John Ehrhardt, Ruben Badaro, Nitin Gupta, Victoria Dudin, Benjamin Renard, Gautam Shanbhag, Barak Yagour, Aparna Ramani
The post DrP: Meta’s Root Cause Analysis Platform at Scale appeared first on Engineering at Meta.
]]>The post How We Built Meta Ray-Ban Display: From Zero to Polish appeared first on Engineering at Meta.
]]>Kenan and Emanuel, from Meta’s Wearables org, join Pascal Hartig on the Meta Tech Podcast to talk about all the unique challenges of designing game-changing wearable technology, from the unique display technology to emerging UI patterns for display glasses.
You’ll also learn what particle physics and hardware design have in common and how to celebrate even the incremental wins in a fast-moving culture.
Download or listen to the episode below:
You can also find the episode wherever you get your podcasts, including:
The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.
Send us feedback on Instagram, Threads, or X.
And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.
The post How We Built Meta Ray-Ban Display: From Zero to Polish appeared first on Engineering at Meta.
]]>The post How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks appeared first on Engineering at Meta.
]]>Sometimes functions within operating systems or provided by third parties come with a risk of misuse that could compromise security. To mitigate this, we wrap or replace these functions using our own secure-by-default frameworks. These frameworks play an important role in helping our security and software engineers maintain and improve the security of our codebases while maintaining developer speed.
But implementing these frameworks comes with practical challenges, like design tradeoffs. Building a secure framework on top of Android APIs, for example, requires a thoughtful balance between security, usability, and maintainability.
With the emergence of AI-driven tools and automation we can scale the adoption of these frameworks across Meta’s large codebase. AI can assist in identifying insecure usage patterns, suggesting or automatically applying secure framework replacements and continuously monitoring compliance. This not only accelerates migration but also ensures consistent security enforcement at scale.
Together, these strategies empower our development teams to ship well-secured software efficiently, safeguarding user data and trust while maintaining high developer productivity across Meta’s vast ecosystem.
Designing secure-by-default frameworks for use by a large number of developers shipping vastly different features across multiple apps is an interesting challenge. There are a lot of competing concerns such as discoverability, usability, maintainability, performance, and security benefits.
Practically speaking, developers only have a finite amount of time to code each day. The goal of our frameworks is to improve product security while being largely invisible and friction-free to avoid slowing developers down unnecessarily. This means that we have to correctly balance all those competing concerns discussed above. If we strike the wrong balance, some developers could avoid using our frameworks, which could reduce our ability to prevent security vulnerabilities.
For example, if we design a framework that improves product security in one area but introduces three new concepts and requires developers to provide five additional pieces of information per call site, some app developers may try to find a way around using them. Conversely, if we provide these same frameworks that are trivially easy to use, but they consume noticeable amounts of CPU and RAM, some app developers may, again, seek ways around using them, albeit for different reasons.
These examples might seem a bit obvious, but they are taken from real experiences over the last 10+ years developing ~15 secure-by-default frameworks targeting Android and iOS. Over that time, we’ve established some best practices for designing and implementing these new frameworks.
To the maximum extent possible, an effective framework should embody the following principles:
Now that we’ve looked at the design philosophy behind our frameworks, let’s look at one of our most widely used Android security frameworks, SecureLinkLauncher.
SecureLinkLauncher (SLL) is one of our widely-used secure frameworks. SLL is designed to prevent sensitive data from spilling through the Android intents system. It exemplifies our approach to secure-by-default frameworks by wrapping native Android intent launching methods with scope verification and security checks, preventing common vulnerabilities such as intent hijacking without sacrificing developer velocity or familiarity.
The system consists of intent senders and intent receivers. SLL is targeted to intent senders.
SLL offers a semantic API that closely mirrors the familiar Android Context API for launching intents, including methods like startActivity() and startActivityForResult(). Instead of invoking the potentially insecure Android API directly, such as context.startActivity(intent);, developers use SecureLinkLauncher with a similar method call pattern, for example, SecureLinkLauncher.launchInternalActivity(intent, context);. Internally, SecureLinkLauncher delegates to the stable Android startActivity() API, ensuring that all intent launches are securely verified and protected by the framework.
public void launchInternalActivity(Intent intent, Context context) {
// Verify that the target activity is internal (same package)
if (!isInternalActivity(intent, context)) {
throw new SecurityException("Target activity is not internal");
}
// Delegate to Android's startActivity to launch the intent
context.startActivity(intent);
}
Similarly, instead of calling context.startActivityForResult(intent, code); directly, developers use SecureLinkLauncher.launchInternalActivityForResult(intent, code, context);. SecureLinkLauncher (SLL) wraps Android’s startActivity() and related methods, enforcing scope verification before delegating to the native Android API. This approach provides security by default while preserving the familiar Android intent launching semantics.
One of the most common ways that data is spilled through intents is due to incorrect targeting of the intent. As an example, following intent isn’t targeting a specific package. This means it can be received by any app with a matching <intent-filter>. While the intention of the developer might be that their Intent ends up in the Facebook app based on the URL, the reality is that any app, including a malicious application, could add an <intent-filter> that handles that URL and receive the intent.
Intent intent = new Intent(FBLinks.PREFIX + "profile");
intent.setExtra(SECRET_INFO, user_id);
startActivity(intent);
// startActivity can’t ensure who the receiver of the intent would be
In the example below, SLL ensures that the intent is directed to one of the family apps, as specified by the developer’s scope for implicit intents. Without SLL, these intents can resolve to both family and non-family apps,potentially exposing SECRET_INFO to third-party or malicious apps on the user’s device. By enforcing this scope, SLL can prevent such information leaks.
SecureLinkLauncher.launchFamilyActivity(intent, context);
// launchFamilyActivity would make sure intent goes to the meta family apps
In a typical Android environment, two scopes – internal and external – might seem sufficient for handling intents within the same app and between different apps. However, Meta’s ecosystem is unique, comprising multiple apps such as Facebook, Instagram, Messenger, WhatsApp, and their variants (e.g., WhatsApp Business). The complexity of inter-process communication between these apps demands more nuanced control over intent scoping. To address this need, SLL provides a more fine-grained approach to intent scoping, offering scopes that cater to specific use cases:
By leveraging these scopes, developers can ensure that sensitive data is shared securely and intentionally within the Meta ecosystem, while also protecting against unintended or malicious access. SLL’s fine-grained intent scoping capabilities, which are built upon the secure-by-default framework principles discussed above, empower developers to build more robust and secure applications that meet the unique demands of Meta’s complex ecosystem.
Adopting these frameworks in a large codebase is non-trivial. The main complexity is choosing the correct scope, as that choice relies on information that is not readily available at existing call sites. While one could imagine a deterministic analysis attempting to infer the scope based on dataflows, that would be a large undertaking. Furthermore, it would likely have some precision-scalability trade-off.
Instead, we explored using Generative AI for this case. AI can read the surrounding code and attempt to infer the scope based on variable names and comments surrounding the call site. While this approach isn’t always perfect, it doesn’t need to be. It just needs to provide good enough guesses, such that code owners can one-click accept suggested patches.
If the patches are correct in most cases, this is a big timesaver that enables efficient adoption of the framework. This complements our recent work on AutoPatchBench, a benchmark designed to evaluate AI-powered patch generators that leverage large language models (LLMs) to automatically recommend and apply security patches. Secure-by-default frameworks are a great example of the kinds of code modifications that an automatic patching system can apply to improve the security of a code base.
We’ve built a framework leveraging Llama as the core technology, which takes locations in the codebase that we want to migrate and suggests patches for code owners to accept:

The AI workflow starts with a call site we want to migrate including its file path and line number. The location is used to extract a code snippet from the code base. This means opening the file where the call site is present, copying 10-20 lines before and after the call site location, and pasting this into the prompt template that gives general instructions as to how to perform the migration. This description is very similar to what would be written as an onboarding guide to the framework for human engineers.
The prompt is then provided to a Llama model (llama4-maverick-17b-128e-instruct). The model is asked to output two things: the modified code snippet, where the call site has been migrated; and, optionally, some actions (like adding an import to the top of a file). The main purpose of actions is to work around the limitations of this approach where all code changes are not local and limited to the code snippet. Actions enable the model fix to reach outside the snippet for some limited, deterministic changes. This is useful for adding imports or dependencies, which are rarely local to the code snippet, but are necessary for the code to compile. The code snippet is then inserted back to the code base and any actions are applied.
Finally, we perform a series of validations on the code base. We run all of these with and without the AI changes and only report the difference:
If any errors arise during the validation, their error messages are included in the prompt (along with the “fixed” code snippet) and the AI is asked to try again. We repeat this loop five times and give up if no successful fix is created. If the validation succeeds, we submit a patch for human review.
By adhering to core design principles such as providing an API that closely resembles existing OS patterns, relying solely on public and stable OS APIs, and designing frameworks that cover broad user bases rather than niche use cases, developers can create robust, secure-by-default features that integrate seamlessly into existing codebases.
These same design principles help us leverage AI for smoothly adopting frameworks at scale. While there are still challenges around the accuracy of generated code – for example, the AI choosing the incorrect scope, using incorrect syntax, etc., the internal feedback loop design allows the LLM to automatically move past easily solvable problems without human intervention, increasing scalability and reducing developer frustration.
Internally, this project helped prove that AI could be impactful for adopting security frameworks across a diverse codebase in a way that is minimally disruptive to our developers. There are now a variety of projects tackling similar problems across a variety of codebases and languages – including C/++ – using diverse models and validation techniques. We expect this trend to continue and accelerate in 2026 as developers become more comfortable with state of the art AI tools and the quality of code that they are capable of producing.
As our codebase grows and security threats become more sophisticated, the combination of thoughtful framework design and intelligent automation will be essential to protecting user data and maintaining trust at scale.
The post How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks appeared first on Engineering at Meta.
]]>