In just the past few years, large machine learning models have made incredible strides. Today’s models are not only remarkably capable but also achieve impressive results across a range of applications, from software engineering and scientific research to content creation and data analysis. With the arrival of models like Kimi-K2.5 and GLM-5, the pace of progress shows no sign of slowing down. (Kimi-K2.5 has an impressive 1 trillion parameters, nearly twice as many as the DeepSeek V3 model family that was released just last year.) And as these models continue to grow in size and capability, so does the demand for memory, computing power, and energy.
One of the most effective ways teams are addressing these constraints is through low-bit inference, a set of techniques widely adopted across the industry that make AI models faster and cheaper to run by reducing how much memory and compute they need when serving real user requests. At Dropbox, products like Dropbox Dash rely on various models to deliver fast, reliable, and cost-effective AI-powered search and understanding across vast amounts of user content. Making this possible requires careful attention to model efficiency, hardware utilization, and latency constraints. And making this technology accessible to individuals and businesses means tackling new challenges around efficiency and resource use.
In this article, we’ll dive into the current landscape of low-bit compute for efficient inference. We’ll cover the different types of quantization, why and when they’re needed, and the key optimization challenges required to deploy advanced AI models in production.
At Dropbox, almost all the models used in-house are attention-based architectures used for tasks like understanding text, images, videos, and audio—core capabilities behind Dash’s ability to search, summarize, and reason over large collections of user content. As these models grow in size and complexity, efficiently serving them in production becomes a central challenge for delivering responsive user experiences. In attention-based models, most of the compute comes from repeated matrix multiplications in two main parts of the model:
The first is the linear layers, which compute embeddings throughout the model. These include:
The second is the attention mechanism itself, where the model evaluates relationships across the input to determine which information is most relevant, a step that significantly increases compute cost with longer context sizes.
On GPUs, these matrix multiplications are handled by specialized hardware. NVIDIA GPUs use Tensor Cores, while AMD GPUs use Matrix Cores. These dedicated processors are accessed through matrix multiply-accumulate (MMA) instructions and are designed specifically to accelerate matrix operations—the heavy-duty math that underpins large-scale linear algebra in neural networks—delivering substantial performance gains compared to executing the same work on general-purpose CUDA Cores.
One notable property of these cores is their scaling behavior. As numerical precision is reduced, these cores can perform more matrix operations per second, typically resulting in higher FLOPS (floating point operations per second, or how much math the hardware can do in a given time). In practice, halving the precision often allows these cores to roughly double throughput. This scaling behavior plays a key role in improving both performance and efficiency when running large-scale AI workloads.
Fig. 1: Tensor Core dense matrix multiplication performance (FLOPs) across different NVIDIA RTX 6000 variants and data precisions
Lowering numerical precision is accomplished through quantization, a technique that reduces the number of bits used to represent numerical values. By quantizing tensors, for example, from 16-bit to 8-bit or 4-bit, the memory footprint is reduced because each element requires fewer bits. This is typically done by rescaling the data to fit within a smaller representable range. For instance, 8-bit quantization maps values to 256 bins, restricting each tensor element to one of these discrete levels while approximating the original floating-point values. Quantization to lower than 8 bits typically requires an additional process called bitpacking, where multiple low-bit elements are combined into a native data type such as uint8 or int32, since 4-bit formats are not natively supported.
Lowering precision not only improves speed and memory usage, but also improves energy efficiency, since lower-bit data requires less power for both memory transfer and computation. For instance, with FP4 support, Blackwell offers significant energy savings compared to the H100.
There have also been attempts to explore lower bits such as binary and ternary weights (restricting weights to two or three discrete levels), which would offer even more theoretical energy efficiency. However, this form of quantization isn’t well suited for modern GPUs because it can’t fully leverage Tensor/Matrix Cores. Although there have been experimental efforts to explore custom hardware or specialized accelerators tailored to such schemes, this approach hasn’t yet seen broad industry adoption as a result of limited ecosystem support and model quality concerns. In short, while lower precision can dramatically improve efficiency, real-world gains depend on how well those formats are supported by existing hardware and software ecosystems.
In the following section, we examine different quantization configurations and highlight their key trade-offs when deployed on modern GPUs.
Quantization is not a single technique, but a family of approaches that differ in how numerical values are represented, scaled, and executed on hardware. These design choices directly affect model accuracy, performance, and how efficiently modern GPUs can accelerate inference. As a result, quantization formats are closely tied to the capabilities and constraints of the underlying hardware.
In practice, these differences matter because Dropbox runs a diverse set of AI workloads—such as multimedia understanding—across multiple generations of hardware, each with distinct performance characteristics. Some workloads are highly latency sensitive, prioritizing fast per-request execution, while others are throughput oriented and optimized for processing large volumes of data efficiently. Quantization formats influence how well a model can adapt to these constraints, determining whether computation is bound by software overhead, memory bandwidth, or specialized hardware units like Tensor Cores. Framing quantization through this lens helps clarify why different formats unlock different tradeoffs across our stack, and why no single approach is optimal for every workload we deploy.
With the introduction of the MXFP microscaling format, which standardizes low-bit data types with native hardware support, quantization methods for large language models can be broadly grouped into two categories: pre-MXFP formats, which rely on explicit dequantization and software-managed scaling, and MXFP formats, which move these operations directly into Tensor Core hardware. The sections below walk through both approaches, highlighting how they differ in practice and why those differences matter for real-world inference workloads.
Prior to the introduction of MXFP, quantization primarily relied on integer data types for sub-byte formats. Common configurations included A16W4 (16-bit activations, 4-bit weights) for weight quantization, and either integer or floating-point formats for activations, such as A8W8 (8-bit activations, 8-bit weights). In contrast, sub-byte weight quantization generally requires calibration or more advanced algorithms to maintain model quality. For example, A16W4 relies on techniques such as AWQ or HQQ—quantization methods designed to preserve model quality at low bit widths—while lower-bit formats like A16W3, A16W2, and BitNet require increasingly sophisticated training quantization methods to achieve acceptable accuracy.
When activations and weights use different data types, the typical approach is to explicitly dequantize the lower-bit tensors to match the higher-precision format before performing the matrix multiplication (MMA) operation. This strategy can improve performance in memory-bound scenarios, where reducing data movement is the primary concern. However, in compute-bound workloads, the additional dequantization step can offset these gains and even slow execution due to the extra arithmetic involved.
This trade-off is especially visible in weight-only quantization, which reduces data transfer but does not accelerate the extra computation required to run matrix multiplications. The choice between activation quantization (such as A8W8) and weight-only quantization (such as A16W4) ultimately depends on the characteristics of the inference workload. Weight-only quantization often performs better in local deployments with smaller batch sizes and reasoning-heavy tasks, where memory bandwidth is a limiting factor. In contrast, activation quantization tends to be more effective for large-context prefills and high-throughput serving scenarios, where compute becomes the dominant bottleneck.
Fig. 2: A8W8 vs. A16W4 decoding performance across various batch sizes. A8W8 tends to outperform A16W4 in more compute-bound scenarios. A16W4 tends to perform worse than 16-bit matrix multiplication due to the additional cost of explicit dequantization
Popular methods such as AWQ and HQQ rely on linear quantization with grouping, a design that balances efficiency with accuracy. In symmetric linear quantization, dequantization is expressed as a simple scaling operation. A more flexible variant, asymmetric linear quantization, introduces an additional offset, allowing dequantization to be implemented as a fused multiply-add operation that maps efficiently to modern GPU hardware.
Grouping further improves accuracy by assigning shared parameters to small blocks of tensor elements rather than individual values. These groups typically consist of contiguous elements of size 32, 64, or 128. While simple, this approach substantially reduces quantization error at low-bit widths and has become a core component of most practical low-bit quantization schemes.
Fig. 3: Linear quantization overview where a matrix W is decomposed into Wq (low-bit tensor) and additional floating-point scales (s) and zero-points (z)
On the activation side, two 8-bit approaches are commonly used: channel-wise quantization and per-block quantization. Channel-wise quantization is straightforward and efficient, making it well suited for on-the-fly inference. The required rescaling can be applied directly after matrix multiplication, allowing for a highly efficient implementation on modern GPUs.
Per-block quantization, popularized by systems such as JetFire and DeepSeek V3, takes a more fine-grained approach. By dividing tensors into small tiles and assigning an independent scale to each block, this method limits the impact of outliers and reduces quantization error. It is particularly effective in quantization-aware training, where preserving pre-training accuracy is critical, while still delivering practical Tensor Core speedups.
Beyond linear quantization, several non-linear approaches, including QuiP# and GPTVQ, have explored alternative representations to push precision even lower. While these methods can achieve higher accuracy at very low-bit widths, they face practical challenges. Linear 4-bit quantization already delivers strong accuracy and can often be applied on the fly using techniques such as HQQ, avoiding expensive offline quantization passes. In addition, deploying non-linear formats efficiently requires custom fused kernels and deep integration into inference frameworks. Even then, low-bit weights must still be converted into a form compatible with Tensor Cores, making linear quantization both simpler and more practical on current GPU architectures.
Quantization techniques are also well-suited for optimizing the attention module. Methods such as Flash Attention 3 and Sage Attention use 8-bit quantization to accelerate attention-related matrix multiplications, improving throughput and memory efficiency with minimal impact on model accuracy.
The MXFP microscaling format introduces a new standard for low-bit data types that fundamentally changes how quantized models run on modern GPUs. Unlike earlier formats, MXFP provides native hardware support for quantization, allowing Tensor Cores to operate directly on quantized activations, weights, and their associated scaling factors in a single fused operation. In contrast, pre-MXFP approaches required explicit dequantization steps before or after matrix-multiply-accumulate (MMA) operations, adding overhead and limiting achievable performance.
MXFP quantizes both activations and weights using a micro-scaling approach, similar in spirit to methods like AWQ and HQQ discussed earlier, but implemented directly in hardware. It uses symmetric quantization with a fixed block size of 32 and applies shared scaling factors stored in the E8M0 format. MXFP also supports mixed-precision MMA operations on some hardware, such as MXFP8 × MXFP4, giving practitioners flexibility to balance performance and accuracy. For example, activations can use MXFP8, MXFP6, or MXFP4 while the weights can remain in MXFP4. A breakdown of the MX types is demonstrated in the table below (source: Open Compute Project, OCP Microscaling Formats (MX) Specification, Version 1.0, Table 1).
Fig. 4: MX dtype breakdown
The E8M0 format for the scales represents positive powers of two in the range [2⁻¹²⁷, 2¹²⁷]. The scales are typically quantized as follows: scale = weight.amax(axis=1, keepdim=True) / max_val. As a result, scale values are effectively limited to values at or below 1, and extremely small magnitudes are rarely needed. In many cases, values as small as 2⁻¹⁵ are sufficient to capture near-zero weights. This observation suggests that scales could theoretically be represented with fewer bits than E8M0, although doing so would introduce additional complexity.
While E8M0 offers hardware-friendly implementation and flexibility, constraining scale values strictly to powers of two leads to a noticeable accuracy drop when using MXFP4. Fortunately, this loss can largely be mitigated through simple post-training adjustments, restoring most of the original model quality, as we demonstrated in our blog post.
To address remaining numerical limitations, NVIDIA introduced NVFP4 as an alternative to MXFP4. NVFP4 uses a smaller group size of 16 rather than 32 and employs E4M3 FP8 scaling factors, providing higher precision for scale representation. Because FP8 has a relatively large minimum representable value, a global per-tensor floating-point multiplier is applied to normalize the scaling range, achieving improved numerical stability.
Although MXFP4 and NVFP4 are standardized formats, their implementation depends on the GPU architecture. Different compute capabilities rely on different Tensor Core instructions. For example, sm_100 architectures use the tcgen05.mma instruction, while sm_120 architectures use mma.sync, both incorporating the block_scale modifier. As a result, kernels compiled for sm_100 are not portable to sm_120 due to these instruction-level differences. While most of the mainstream AI software stack remains focused on server-grade GPUs like the B200 and B300, there has been significant recent progress toward improving portability of low-bit workloads. Notably, Triton has introduced support for MXFP on sm_120 devices, enabling greater flexibility and cross-device compatibility for low-bit Triton kernels.
In this article, we explored several quantization techniques that are widely adopted across the industry to accelerate AI workloads. These approaches unlock substantial gains in efficiency and throughput, making it possible to deploy increasingly large and capable models within practical hardware, cost, and energy constraints.
At Dropbox, these considerations are central to how we build and operate products like Dash. Dash relies on large-scale models for experiences such as conversational AI, multimodal search, document understanding, and speech processing, all of which must meet strict latency, reliability, and cost requirements. To satisfy these constraints in production, we already employ a range of quantization strategies to optimize model deployment and fully utilize modern accelerators. The techniques discussed here reflect the kinds of trade-offs we evaluate when deciding how and where to run models across our infrastructure.
Despite the progress, important limitations remain. In real-world deployments, adoption of formats such as MXFP and NVFP is still evolving, and support for FP4 quantization remains incomplete across popular frameworks and model stacks. For example, many open-source runtimes don’t yet provide full support across different GPU architectures, and FP4 models are not yet widely available.
As hardware continues to evolve and the industry pushes toward lower-bit compute, these challenges will only become more pronounced. In our view, making low-bit inference viable for production systems like Dash will require tighter software design, more mature framework support, and new quantization techniques that preserve model quality at scale. We view this as an active area of exploration, one that will directly shape how we deliver fast, reliable, and efficient AI-powered experiences to Dropbox users in the years ahead.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
Improving engineering productivity is crucial to the work we do at Dropbox. The more quickly we can deliver high-quality features to our customers, the more value they can get from our products. This rapid iteration has been key to developing tools like Dropbox Dash, context-aware AI that connects to all your work apps, so you can search, ask questions about, and organize all your content.
In the process of building Dash, we’ve become big adopters of AI tools in our own work, from Claude Code to Cursor. The early results have been promising, but there are still a lot of open questions about how to work with these tools most effectively and where they can have the most impact. To push this conversation forward, Dropbox CTO Ali Dasdan hosted an executive roundtable on December 11, 2025, at our San Francisco studio. We brought together a small group of technology leaders from top companies for an afternoon of open discussion, idea-sharing, and a deep dive into the evolving world of engineering productivity and AI. Here’s how it went.
Adopting AI tooling for the sake of AI is meaningless; it must be tied to tangible business results. As we navigate this shift, we’ve had to ask ourselves: Which approach is the right one? What existing processes need to be upgraded in light of AI workflows? To kick off the event—and show attendees how we’ve been thinking through these questions at Dropbox—Uma Namasivayam, Senior Director of Engineering Productivity, took a closer look at our own experimentation, adoption, and enablement cycle to accelerate engineering productivity with AI.
We started by working with Dropbox leadership to gain buy-in and establish the importance of AI tooling, and together made AI adoption a company-level priority. This turned AI from a grassroots experiment into an urgent organizational priority, and helped everyone get aligned. Teams were now empowered to experiment with tooling, and we reduced the overhead associated with getting contracts approved to pilot new tooling at Dropbox.
In our experimentation, Dropbox saw impact across the entire software development life cycle, from code review and documentation to debugging and testing. Like other large organizations, Dropbox has our unique challenges. Off-the-shelf AI tools don’t always fit our scale constraints—we have a very large, multi-language monorepo—so we’ve had to be deliberate about where to adopt, where to extend, and where to build our own capabilities. For example, Dropbox built our own AI tooling that listens for failed builds on pull requests and uses our AI platform to propose fixes to them.
As a result of our efforts, most Dropbox developers are now using at least one AI tool in their workflows. We track pull request (PR) throughput per month, per engineer as a core metric. You can see how users who are engaging more with AI coding tools have an outsized impact on the code shipped, measured by PR throughput per month.
We also closely monitor the sentiment of engineers internally regarding AI tooling. As strong positive sentiment increases, we’re seeing the share of negative sentiment go down.
Most importantly, developers feel less friction using AI to accelerate their work because we’ve made it easier to adopt tooling according to what they feel works best for their team.
The heart of the evening was a roundtable discussion designed to cross-pollinate ideas across different industries. To facilitate this, we divided attendees into three cohorts, rotating the groups for each question so that every leader could learn from three different peer groups.
The discussion centered around three core pillars:
Following the structured session, the conversation continued over a cocktail hour, where leaders shared further insights into the commitment to craft required to lead in the age of AI.
The overarching themes that emerged from the roundtable discussions centered around the following:
Still, there are a number of open questions, such as: If AI is giving us more capacity, where is that capacity actually going? For Dropbox, this capacity is currently being channeled into areas like addressing tech debt, executing migrations, and improving reliability.
However, a key challenge remains in effectively connecting these productivity gains to tangible business outcomes—a challenge also voiced by many attendees during the roundtable. Therefore, the focus for 2026 will be on mapping productivity directly to specific outcomes, extending operational rigor beyond engineering teams, and ultimately driving end-to-end product velocity.
A huge thank you to everyone who made the trip to our San Francisco studio and contributed to such a memorable event. If you missed out this time, keep an eye on our events page for future opportunities to connect!
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
I was recently a guest speaker in Jason Liu’s online course on RAG offered by the education platform Maven. I did some mini deep-dives into what we’ve been doing at Dropbox with knowledge graphs; how we’re thinking about indexes, MCP, and tool calling in general; some of the work we do with LLM as a judge; and how we use prompt optimizers like DSPy. This is an edited and condensed version of my talk. Visit Maven to watch the full video and hear my Q&A with Jason and his students. — Josh Clemm, vice president of engineering for Dropbox Dash
~ ~ ~
I don't know about you, but I probably have about 50 tabs open right now—and at least another 50 accounts for other SaaS apps. It’s completely overwhelming. It means your content is all over the place, and that makes it very, very hard to find what you're looking for. The good news is we have all these amazing LLMs coming out every day that can tell you about quantum physics. But the bad news is they don’t have access to your content. All of your work content is proprietary. It's within your walled garden. It means most LLMs can’t help when it comes to your work.
That’s why we’ve been building Dropbox Dash. It doesn't just look at your Dropbox content. It connects to all your third-party apps and brings it into one place, so you can search, get answers, and do the agentic queries that you want to do at work.
Here’s a brief primer on our tech stack and how Dash works.
First, we have our connectors. This is where we're building custom crawlers and connecting to all these different third-party apps. It’s not easy. Everything has its own rate limit, each has its own unique API quirks, each has its own ACL and permission system, etc. But getting that right is essential and getting all that content in one place is the goal.
Next, we're doing a lot of content understanding—and in certain cases, enriching the content itself. So, first we normalize a lot of the different files that come in and get it into a format like markdown. Then, we’re looking at extracting key information. We're going to be looking at titles, metadata, trying to extract links, and generate different embeddings.
For documents, this is fairly straightforward. Just grab the text, extract it, throw it in the index, and you're done. Images require media understanding. CLIP-based models are a good start, but complex images need true multimodal understanding. Then you get to PDFs, which might have text and figures and more. Audio clips need to be transcribed. And then finally you get to videos. What if a client has a video like this very famous scene from Jurassic Park. How would you find this later? There's no dialogue, so you can't really rely on pure transcription. This is where you would need to use a multimodal model and extract certain scenes, generate understanding for each one, and then store that.
After we understand the incoming content, we take it a step further to model all these pieces of information together as a graph. Meetings may have associated documents, associated people, transcripts, or prior notes. Building that cross-app intelligence is essential to providing better context for our users. This is where we're going to start to do the knowledge graph bundle that I'll talk more about later in depth.
From there, all that information (embeddings, chunks, contextual graph representations) flows into our highly secure data stores. Today we use both a lexical index—using BM25—and then store everything as dense vectors in a vector store. While this allows us to do hybrid retrieval, we found BM25 was very effective on its own with some relevant signals. It’s an amazing workhorse for building out an index.
Finally, we apply multiple ranking passes on any retrieved results so they are personalized and ACL’d to you.
Altogether, this is what we call our context engine. And once you have that, you can introduce APIs on top of it and build entire products like Dash.
Okay, but why build an index? Why did we even go down this route in the first place? Well, there's a bit of a choose-your-fighter kind of mentality in the world right now between federated retrieval and indexed retrieval. The difference is very classic software engineering. Are you going to process everything on the fly? That’s federated retrieval. Or are you going to try to pre-process it all at ingestion time? That's index-based retrieval. And there are pros and cons to each approach.
Federated retrieval is very easy to get up and running. You don't have to worry about storage costs. The data is mostly fresh. You can keep adding more MCP servers and new connectors. But there are some big-time weaknesses here. You're at the mercy of all these different APIs or MCP servers which are going to differ in speed, quality, and ranking. You’re also limited in what you can access. You can access your information, but you probably don’t have access to company-wide connectors—meaning you can’t access content that’s shared across the whole company. And you have to do a lot of work on-the-fly in the post-processing. Once the data comes back, you have to merge information and potentially do re-ranking. And if you're using a lot of chatbots today with MCP, you're going to see that token count go up and up. It takes a lot of tokens to reason over this amount of information.
On the flip side, with index-based retrieval, you do now have access to those company connectors. And because you have time on your side, you can pre-process that content and create these really interesting enriched data sets that don't exist on their own. You can also do a lot more offline ranking experiments. You can try different methods to improve your recall, and it’s very, very fast. But it's also a ton of work—and a lot of custom work. This is not for the faint of heart. You have to write a lot of custom connectors. As for ingestion time, you're going to have freshness issues if you're not good with understanding rate limits. It can also be extremely expensive to host this information, and then you have to decide how to store it. Am I using a vector database, like classic RAG from many years ago? Am I going the BM25 route? Do I want to do hybrid? Do I want to do a full graph RAG, which is what we ended up going with? There are a lot of decisions you have to make.
Now what about MCP? There was a lot of hype when MCP burst onto the scene about a year ago. Everybody was talking about it: “You don't need any of these APIs anymore, you just add MCP to your agent.” Sounds great, right? But there are some major challenges with how MCP is typically implemented.
MCP tool definitions, in particular, take up valuable real estate in your context window. We’re noticing quite a bit of degradation in the effectiveness of our chat and agents (very classic context rot). So with Dash, we're trying to cap things to about 100,000 tokens. But those tool definitions do fill up quickly. The results are quite significant, especially if you're doing retrieval. You're getting a lot of content back, and you're immediately going to fill up that context window. It's going to be very problematic. It’s also incredibly slow. So, if you’re using MCP with some agents today, even a simple query can take up to 45 seconds—whereas with the raw index, you're getting all the content coming back very quickly, within seconds.
Here are some of the ways we’ve solved for that:
The next question that comes up a lot: are knowledge graphs worth it? Well, let’s look at how a knowledge graph works.
You start by modeling these different relationships across these various apps. For example, say you've got a calendar invite. It might have attachments, meeting minutes, a transcript. Of course, it also has all the attendees, and maybe there's even a Jira project associated. Every app that we connect with has its own concept or definition of people, and so coming up with a canonical ID for who someone is is very, very impactful for us overall. Being able to model something like that is incredibly powerful. You can go view somebody's profile on Dash today, but it also helps a ton in relevance and retrieval.
Say that I want to find all the past context engineering talks from Jason. But who is Jason? How do you know that? Well, if you have this graph—this people model—you can then go ahead and fetch that and add that to the context, and it's not having to do a ton of different retrieval overall. Fantastic. And we use normalized discounted cumulative gain (NDCG) a lot to score the results to retrieve. But just by doing this people-based result we saw some really nice wins.
The architecture itself is complicated. I won't talk a ton here, but it's important to realize we're not just storing a one-to-one mapping of source doc to end doc. We do want to derive and create more unique characteristics. And the other key insight here is we're not storing these graphs in a graph database. We did experiment with that. The latency and query pattern were a challenge. Trying to figure out that hybrid retrieval was a challenge. And so we ended up building these graphs in a more unique way. We’re staging it more asynchronously, we're building out these relationships, and then we create these knowledge bundles. So again, it's not necessarily a graph, but think of it almost like an embedding—like a summary of that graph. And it becomes these little contexts that contain all this information. And with that context, we actually just send it on through the exact same index pipeline that we have for all the other content. So things will get chunked and things will generate embeddings for both lexical as well as semantic retrieval.
Alright, we've indexed all this content. We've got content understanding. We've done a ton of work on trying to model these relationships. But did the retrieval quality actually improve? How do we know?
Take Google search, for example. You have your 10 blue links, and the audience for those results are humans. If your results are high quality, the humans will tell you by clicking. You can quickly get some amazing signal this way. The model is either working or it isn’t.
In the world of chat, you're still retrieving results, but it's not for the human. It's for this large language model. And so you no longer have those humans to help us out. So what do you do? That's where you want to use LLMs as a judge. Broadly speaking, what you're trying to do is judge how relevant a piece of information is between, say, one and five, and then use that to improve over time.
Humans can still help here. Sometimes they give you thumbs ups and thumbs down on the quality of your results. You can also bring in human evaluators to help you. When we started these experiments, we asked ourselves: How accurate can we get our judge to match what a human will do? And so we had a bunch of our engineers label a ton of documents to see how much of a disagreement there was between the human and the LLM as a judge. The first prompt for our judge wasn’t bad—8% disagreed—but the lower, the better.
Next, we continued to refine the prompt. You know, classic prompt tuning like “provide explanations for what you're doing.” And sure enough, disagreements went down. Then, we just upgraded the model itself to OpenAI’s o3. It's a reasoning model, far more powerful, and guess what? Disagreements with the humans went down further.
Finally, a big problem with using an LLM as a judge in a work context is that it doesn't know things like acronyms. If I were to say, “What is RAG?”—and hopefully it knows what RAG is—what if it hasn’t been trained on that? Sometimes, the judge needs to go get that context. And so, this is a little tongue-in-cheek, but we call this RAG as a judge. It can't just be using pre-computed information. Sometimes it has to go fetch some context itself. And with that, we dropped disagreements even further.
There's a growing community around prompt optimizers, and one of the technologies in particular we've been using is DSPy. It helps optimize your prompts. It tries to get the most accurate information based on a set of evals. So by bringing in DSPy, we got even better results overall.
It might be impossible to get to zero disagreements. Even humans—multiple humans—will disagree on the relevance set. But we're going to keep grinding on this. And even if we can't get to zero, we're actually quite pleased with some of the results we're getting with DSPy.
One thing to note: We saw some really interesting emergent behavior happening with DSPy. Instead of simply telling us what the improvements could be, we noticed we could create bullet points with the different disagreements and then have DSPy try to optimize the bullets themselves. So if there were multiple disagreements, it would try to reduce those disagreements overall, and we started to create this really nice flywheel and ended up getting some nice results.
There are some other benefits of DSPy. So first, obviously, is prompt optimization. It helped us quite a bit in our LLM-as-a-judge area. Again, that's a prime place to think about DSPy right now, because LLMs as a judge have very crystal clear rubrics and evals. You know exactly what the outcome should be. You just need to have the ultimate prompt, and it’s really good for that. We're going to start to experiment with DSPy across our entire stack. We have over 30 different prompts today throughout the Dash stack, whether that's in the ingest path, LLMs as a judge, some offline evals, as well as our online agentic platform approach.
The next one is prompt management at scale. I mentioned we've got about 30 prompts overall, and at any given time we might have 5 to 15 different engineers tweaking these prompts and trying to get more improvements. And it's a little silly if you think about it. You've got this text string that you've checked in to your code repository; but then there's an edge case, this chat session didn't work. So you go in and fix it, but then something else breaks, and it becomes a bit of a whack-a-mole. And so it's very powerful to just define things in a more of a programmatic way and let these tools spit out the actual prompt themselves. It just works better at scale.
And the last really great benefit we like is around model switching. So, every model out there is a bit unique. They have their own quirks, and there's always different ways to prompt them. And anytime you bring in a new model, you have to spend a bunch of time optimizing the prompt again. But with DSPy, you just plug the model in, define your goals, and out spits the prompt that works. So you can do this model switching far more rapidly—and this is really beneficial for modern agentic systems, because you just don't have one giant LLM. You're going to have a planning LLM, you're going to have all these smaller sub-agents, and those sub-agents might be very narrowly focused. You probably want to pick a model that's highly tuned to that particular task, so having something like a prompt optimizer is really powerful.
To wrap things up, here are some key takeaways:
My final, overall takeaway is the classic software engineering concept of: make it work, then make it better. A lot of the techniques and things I've described here are things that we've been doing over the last few years with a big engineering team working on this day-in and day-out. If you're just getting started, absolutely invest in those MCP tools and everything on the real-time side. And then, over time, as you start to see what your customers are doing and you start to get some more scale, look for opportunities to optimize overall.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
Dropbox Dash uses AI to understand questions about your files, work chats, and company content, bringing everything together in one place for deeper, more focused work. With tens of thousands of potential work documents to consider, both search and agents rely on a ranking system powered by real-time machine learning to find the right files fast. At the core of that ranking in Dash is our feature store, a system that manages and delivers the data signals (“features”) our models use to predict relevance.
To help users find exactly what they need, Dash has to read between the lines of user behavior across file types, company content, and the messy, fragmented realities of collaboration. Then it has to surface the most relevant documents, images, and conversations when and how they’re needed. The feature store is a critical part of how we rank and retrieve the right context across your work. It’s built to serve features quickly, keep pace as user behavior changes, and let engineers move fast from idea to production. (For more on how feature stores connect to context engineering in Dash, check out our deep dive on context engineering right here.)
In this post, we’ll walk through how we built the feature store behind Dash’s ranking system, why off-the-shelf solutions didn’t fit, how we designed for speed and scale, and what it takes to keep features fresh as user behavior changes. Along the way, we’ll share the tradeoffs we made and the lessons that shaped our approach.
Building a feature store for Dash wasn’t just a matter of picking something off the shelf, and there are a few reasons why. For one, our infrastructure is split across two very different worlds: an on-premises ecosystem designed for low-latency service-to-service communication, and a Spark-native cloud environment where feature engineering and large-scale data processing happens. This split ruled out standard cloud-native feature stores and forced us to find a way to bridge both systems without slowing down development velocity.
On top of that, Dash’s search ranking system brought its own scaling challenge. A single user query doesn’t just pull up one document. Instead, it triggers our ranker to evaluate many files, each requiring dozens of behavioral and contextual features. What starts as one search quickly fans out into thousands of feature lookups across interaction history, metadata, collaboration patterns, and real-time signals. Ultimately, our feature store had to handle those kinds of massive parallel reads while still meeting strict, sub-100ms latency budgets.
Relevance also depends on speed and capturing user intent in real-time. If a user opens a document or joins a Slack channel, that signal should show up in their next search—within a few seconds—which meant building an ingestion pipeline that could keep up with user behavior at scale.
Finally, we had to reconcile two very different computation patterns. Some features naturally fit real-time streaming, while others depend on batch processing of historical data. We needed a unified framework that could support both efficiently, thereby reducing cognitive load for engineers and giving them a faster path from idea to production-ready features.
After surveying the feature store landscape—Feast, Hopsworks, Featureform, Feathr, Databricks, and Tecton—Feast stood out for two reasons. First, its clear separation between feature definitions and infrastructure concerns meant our machine learning engineers could focus purely on writing PySpark transformations rather than the serving, storage, or orchestration complexity. Second, Feast’s modular architecture and extensive adapter ecosystem made it straightforward to integrate with our existing infrastructure. (An adapter refers to a Feast-provided interface that integrates its framework with different backend systems.) Its AWS DynamoDB adapter was particularly crucial, allowing us to leverage Dynovault—our in-house DynamoDB-compatible storage solution—to meet latency requirements while lowering costs.
Our Feast-based architecture combines three key components, each optimized for its role.
Feast gave us the orchestration layer and serving APIs, but we swapped out its Python online serving path for our own Go service so we could actually hit the concurrency and latency numbers we needed.
Cloud-based storage took care of the heavy lifting of offline indexing and storage, while Spark jobs handled feature ingestion and computation.
Dynovault handled the instant feature lookups needed for each search query. Co-located with inference workloads and leveraging Dropbox’s hybrid cloud infrastructure, Dynovault avoids the delay of public internet calls and reliably delivers ~20ms client-side latency while balancing cost and geographic scalability.
Around this core architecture, we added observability through job failure monitoring, freshness tracking, and data lineage visibility. The result is a streamlined experience: engineers choose a data source, write PySpark transformations, and request features where needed, while the infrastructure abstracts away offline and online data management, pipeline orchestration, low-latency serving, and data freshness guarantees.
With the architecture in place, the next challenge was meeting Dash’s sub-100ms latency requirements. Feature retrieval sits directly on the critical path of search and LLM answer generation, so even small delays compound quickly at scale and degrade Dash’s snappy search retrieval experience.
Our initial feature-serving implementation was built in Python using the Feast SDK. While parallelism helped at moderate scale, profiling revealed that CPU-bound JSON parsing and Python’s Global Interpreter Lock became the dominant bottlenecks under higher concurrency. Moving to multiple processes temporarily improved latency, but introduced coordination overhead that limited scalability.
To remove these constraints, we rewrote the feature serving layer in Go. Using lightweight goroutines, shared memory, and faster JSON parsing, the Go service delivers true concurrency without the coordination costs we hit in Python. Today, it handles thousands of requests per second while adding only ~5–10ms of processing overhead on top of Dynovault’s client latency, consistently achieving p95 latencies in the ~25–35ms range.
This shift allowed us to meet Dash’s latency targets reliably and ensured that feature serving wouldn’t become the limiting factor as search traffic and feature complexity continued to grow.
Speed matters only when the data itself is fresh. Stale features can lower ranking quality and hurt user experience, so our feature store had to reflect new signals as soon as possible, often within minutes of user actions.
The challenge was scale. Many of Dash’s most important features depend on large joins, aggregations, and historical context, which makes fully real-time computation impractical. We needed an ingestion strategy that balanced freshness with reliability, without overwhelming our infrastructure or slowing development. To do that, we built a three-part ingestion system.
Batch ingestion handles complex, high-volume transformations built atop the medallion architecture (a layered data model that organizes data from raw to refined stages). Rather than rewriting every feature on each run, we added intelligent change detection so only modified records are written to the online store. This reduced write volumes from hundreds of millions to under one million records per run and cut update times from more than an hour to under five minutes.
Streaming ingestion captures fast-moving signals such as collaboration activity or content interactions. By processing unbounded datasets in near-real time, it ensures features stay aligned with what users are doing in the moment.
Direct writes handle lightweight or precomputed features by bypassing batch pipelines entirely. For example, relevance scores produced by a separate LLM evaluation pipeline can be written directly to the online store in seconds instead of waiting for the next batch cycle.
Together, these approaches allow Dash to keep feature values fresh without forcing all computation onto a single ingestion path, maintaining ranking quality while scaling to real-world usage.
Building a feature store at Dropbox scale reinforced a few hard-earned lessons about systems design. On the serving side, Python’s concurrency model became a limiting factor for high-throughput, mixed CPU and I/O workloads. Even with careful parallelism, the Global Interpreter Lock capped performance for CPU-bound work like JSON parsing, and moving to multiple processes introduced new coordination bottlenecks. Rewriting the serving layer in Go allowed us to remove those tradeoffs and scale concurrency more predictably.
On the data side, infrastructure changes mattered, but understanding access patterns mattered just as much. By recognizing that only 1–5% of feature values change in a typical 15-minute window, we were able to dramatically reduce write volumes and ingestion time. This shift turned hour-long batch cycles into five-minute updates, improving freshness without increasing system load.
These optimizations came together in a hybrid architecture that balances flexibility and performance: Feast for orchestration and consistency, Spark for large-scale computation, and Dynovault for low-latency online serving. Rather than relying on a single vendor solution, this approach let us tune each layer to its strengths while keeping training and serving aligned.
Ultimately, this work underscored the value of a middle path between building everything from scratch and adopting off-the-shelf systems wholesale. By combining open source foundations with internal infrastructure and tailoring them to real constraints, we were able to build a feature store that fits the needs of Dash today and, ultimately, can evolve with it in the future.
Acknowledgments: Special thanks to all current and past members of the AI/ML Platform and Data Platform teams for their contributions, as well as our lovely machine learning engineers who spin up the magic with the tooling we build.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
This summer, the Emerging Talent team proudly welcomed 43 interns to Dropbox as part of our 2025 Camp Dropbox Intern Program. Representing 27 colleges and universities—including six international institutions in Canada, Poland, and Ireland—this year’s cohort brought a wealth of diverse perspectives and experiences. Of the group, 28 interns joined our Engineering teams, and over the course of 12 weeks (May through September), they immersed themselves in meaningful work, continuous learning, and our Virtual First culture.
The Dropbox Intern Program is thoughtfully designed to cultivate growth, spark innovation, and build lasting connections. Interns benefited from more than 6,000 hours of dedicated one-on-one mentorship, tackled high-impact projects aligned with team and company goals, and explored hands-on applications of AI. Many of these projects supported the development of Dropbox Dash, our AI-powered universal search product. Robust programming—including Virtual First events, ERG activities, and the in-person Emerging Talent Summit—created further opportunities for connection and community. By the end of the summer, these interns had made meaningful contributions across our engineering organization.
Below, our interns share what they worked on this summer, from big technical wins to moments of creativity, collaboration, and growth that shaped their time at Dropbox.
“I tackled the Dropbox file history tracking system. As an engineering intern working in a large production database for the first time, I learned the importance of strongly tested, verifiable code and thoughtful system design. I really aligned with the core Dropbox values of Be Worthy of Trust and Keep It Simple at the software level. This solution simplifies our metadata infrastructure, significantly reduces operational costs, and shows how thoughtful refactoring of legacy systems can deliver both technical elegance and substantial business value.”
—Rhea Rai, Filesystem Data
![]()
“During my internship with the ML Platform team, I worked on a system that monitors the health of ML model deployments. By integrating with internal inference services, AI Sentinel gives machine learning engineers real-time operational visibility they previously had to gather manually. The result is greater deployment confidence and faster iteration cycles, ensuring reliable ML model deployments that power Dash’s intelligent features at scale.”
—Ben Juntilla, ML Platform
![]()
“I worked on reducing front end latency in Magic Pocket. Elevated PUT latencies during scheduled disk restarts can delay updates in workflows like Dash connectors, leaving users with outdated or missing content. To address this, I built a cache to track storage health and added a filtering option to skip degraded volumes. This health-aware routing reduces slow writes and gives operators greater control, ensuring Dropbox delivers timely, accurate search results.”
—Albert Joon Sung, Storage Core
![]()
“I worked on an AI-powered tool built on top of our internal migration platform that automates code migrations. Developers can launch auto-migration jobs on selected folders for specific migration types. Successful runs open a pull request automatically; otherwise, you can run the command locally and submit changes manually. The tool is fully customizable via the CLI or as part of an automated workflow. With it, I completed two major migrations.”
—Ahmed Ibrahim, Web Developer Experience
![]()
“I built tools that give machine learning engineers access to the most up-to-date information in the Dash persistence store. With this, downstream teams can train models on fresher data and pull in additional metadata fields from third-party systems without waiting for the Connector Platform team to redownload or repackage anything.”
—Eddie Ormseth, Connector Platform
![]()
“I worked on expanding the unified search platform (USP) to support more than 20 languages. The USP powers search across Dropbox products like Replay, and my project integrated a language detection pipeline into both indexing and retrieval. This enables accurate, efficient multilingual search without the overhead of traditional solutions. By delivering native language support ahead of this year’s Dash launch, my work helps Dropbox scale globally, improve the developer experience, and unlock richer search for international customers, bringing us closer to our vision of an AI-first, universally accessible search platform.”
—Rishi Peddakama, Retrieval Platform
![]()
“I explored advanced anomaly detection techniques for our Vortex2 metrics system. Traditional static alerting can miss sudden, meaningful shifts in data that don’t cross predefined thresholds (or trigger too often when changes are expected). To address this, I developed adaptive detection methods that adjust to evolving patterns. These improvements streamline alert creation and reduce alert fatigue, enhancing the on-call experience. By accounting for seasonality, the new anomaly detection functions also enable faster response times and improve the overall developer experience.”
—Yonatan Ginsburg, Metrics
![]()
“I developed a seamless document preview experience within Dropbox Dash, allowing users to quickly view file content without leaving their search context. This enhancement supports the Dropbox mission to accelerate workflows by reducing context switching and increasing engagement. I built interactive UI components, integrated PDF viewing, and implemented dynamic follow-up features linking to AI-powered chat.”
—Francesca Venditti, Find & Discover
![]()
“This summer on the Analytics Platform team, I worked on optimizing large-scale Databricks queries and ETL pipelines to reduce compute cost and latency. I developed an optimization recommendation system that flagged high-cost query patterns, expensive table-column filters, and under-allocated compute resources, complete with actionable sourcing information. I also prototyped and documented an Airflow pipeline to migrate a 500 TB mobile events log to liquid clustering, paving the way for broader adoption of modern data layout techniques.”
—Sanjith Udupa, Analytics Platform
![]()
“I built an extensible AI web-automation agent for Dropbox. I also connected Dropbox backend APIs via searchFile and uploadFile actions to fetch and upload files, using open-source foundations. By keeping tool sets small and modular, developers can quickly compose reliable, task-specific automations, like form filling or proofreading. As demand for automating repetitive web tasks continues to grow, integrating automation tools into Dash will significantly improve the user experience.”
—Alan Zhu, Conversational AI
Responses have been lightly edited for length and clarity.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
When we first built Dash, it looked like most enterprise search systems: a traditional RAG pipeline that combined semantic and keyword search across indexed documents. It worked well for retrieving information and generating concise answers. But as teams began using Dash for more than just finding content—for example, asking it to interpret, summarize, and even act on what it found—we realized that retrieval alone wasn’t enough. The natural progression from “what is the status of the identity project” to “open the editor and write an executive summary of the projects that I own” required Dash to evolve from a search system into an agentic AI.
That shift introduced a new kind of engineering challenge: deciding what information and tools the model actually needs to see to reason and act effectively. This has been popularized as context engineering, the process of structuring, filtering, and delivering just the right context at the right time so the model can plan intelligently without getting overwhelmed. We started thinking about how these ideas applied inside Dash itself, including how the model planned, reasoned, and took action on a request. Instead of simply searching and summarizing results, it now plans what to do and carries out those steps.
At the same time, adding tools into Dash’s workflow created new tradeoffs around how context is managed. Precision in what you feed the model is critical in any RAG system, and the same lesson applies to agentic systems. Supplying the model with only the most relevant context, and not just more of it, consistently leads to better results. Below, we’ll walk through how we’ve been building better context into Dash.
As Dash gained new capabilities—like contextual search and assisted editing—we noticed something unexpected: more tools often meant slower, less accurate decision making. A “tool” here is any external function the model can call, such as search, look-up, or summarization. Each new capability expanded the model’s decision space, creating more choices and room for confusion. Even well-designed tools made the model spend more time deciding how to act instead of acting. The problem wasn’t broken tools; it was too many good ones. In human terms, Dash was facing analysis paralysis.
The Model Context Protocol (MCP), an open standard for defining and describing the tools a server provides, helps with this by outlining what each tool does and what inputs it takes. But as we experimented with MCP servers, we ran into limitations. Each tool we added came with its own description and parameters, which all have to fit inside the model’s context window (the space it uses to read and reason about information). In practice, these definitions also consume a significant number of tokens, a resource that directly impacts both cost and performance. Further, we noticed that the overall accuracy of Dash degraded for longer-running jobs. The tool calls were adding a lot of extra context. We were seeing similar patterns of what’s been popularized as context rot.
This led us to rethink context. Building effective, agentic AI isn’t just about adding more; it’s about helping the model focus on what matters most. In Dash, that means curating context so the model can make faster, better decisions through three core strategies:
Our principle is simple: better context leads to better outcomes. It’s about giving the model the right information, at the right time, in the right form.
Our first insight was that giving the model too many options for calling tools led to worse results. Dash connects to many of the apps our customers use to get work done, and each of those apps provides its own retrieval tools, such as search, find by ID, or find by name.
Although we have the Dash Search index—our server-based search index that stores and manages documents and messages for fast and reliable retrieval—we did experiment with using other tools for retrieval. For example, Dash might need to consult Confluence for documentation, Google Docs for meeting notes, and Jira for project status to service one request. In our experiments with those other retrieval tools, we found that the model often had to call all of them, but it also didn’t do so reliably.
We solved this by replacing all of those retrieval options with a single, purpose-built tool backed by the Dash universal search index. Instead of expecting the model to understand and choose between dozens of APIs, we created one interface that handles retrieval across all services. The key idea is simple: Giving the model one consistent way to retrieve information makes its reasoning clearer, its plans more efficient, and its context use more focused.
These learnings also influenced our design of the Dash MCP server, which brings Dash’s retrieval to MCP-compatible apps like Claude, Cursor, and Goose with just one tool. It connects to the systems people already use and securely searches inside their apps. By keeping descriptions lean, more of the context window stays focused on the user’s request.
Our next insight was that not everything retrieved from multiple APIs is actually useful for the task at hand. When we tried pulling data from several tools at once, we still needed a way to rank and filter the results so that only the most relevant information reached the model.
We built the Dash index to combine data from multiple sources into one unified index, then layered a knowledge graph on top to connect people, activity, and content across those sources. (A knowledge graph maps relationships between these sources so the system can understand how different pieces of information are connected.) These relationships help rank results based on what matters most for each query and each user. As a result, the model only sees content our platform has already determined to be relevant, which makes every piece of context meaningful. Building the index and graph in advance means Dash can focus on retrieval at runtime instead of rebuilding context, which makes the whole process faster and more efficient.
The key lesson is that everything retrieved shapes the model’s reasoning, so relevance is critical to guiding it efficiently. Sending only what’s essential improves both performance and the quality of the entire agentic flow.
Our third discovery was that some tools are so complex that the model needs extra context and examples to use them effectively. We saw this firsthand as we continued to expand the Dash Search tool. Query construction turned out to be a difficult task on its own. It involves understanding user intent, mapping that intent to index fields, rewriting queries for better semantic matching, and handling edge cases such as typos, synonyms, and implicit context.
As the search tool grew more capable, the model needed more instruction to use it correctly. Those details started to take up a significant portion of the context window, leaving less room for reasoning about the overall task. In other words, the model was spending more of its attention on how to search than on what to do with the results.
We solved this by moving search into its own agent. The main planning agent decides when a search is needed and delegates the actual query construction to a specialized agent with its own prompt. This separation allows the main agent to stay focused on planning and execution while the search agent handles the specifics of retrieval. The key lesson is that when a tool demands too much explanation or context to be used effectively, it’s often better to turn it into a dedicated agent with a focused prompt.
Context engineering for agentic AI systems is still an emerging discipline. While the strategies we’ve outlined—retrieval consolidation, relevant context filtering, and specialized task agents—work well for our use cases, we’re continuing to learn and iterate. As we continue to build the best tools for knowledge workers, we’ve found that the Dash index is a powerful resource for managing relevant context and helps us use other tools more effectively.
The work we’ve shared here focuses on one piece of the puzzle: Learning how to trim context down to what really matters, both in tool selection and retrieval. But context is expensive in more ways than one. It affects cost, speed, and how much attention a model can give to the task at hand. We’ve found that leaner contexts don’t just save resources; they also make the model smarter.
Next, we’re turning these lessons toward other parts of Dash’s context, like user and company profiles, as well as long and short-term memory. We think there’s even more performance to unlock by refining these areas, especially as we experiment with smaller and faster models.
Although our discussion centered on retrieval-based tools, action-oriented tools exhibit many of the same limitations. MCP continues to serve as a robust protocol, but effective scaling depends on reducing tool proliferation, investing in specialized agents, and enabling the LLM to generate code-based tools when appropriate, an approach that parallels our consolidation of retrieval tools into the Dash retrieval system. We’ve covered how Dash uses code-based tools in a previous blog post, and we see that other companies are approaching this problem with a similar mindset.
Moving forward, our focus is on making context even more efficient so the model can spend its attention where it matters most.
Acknowledgments: Rene Schmidt, Josh Clemm, Marta Mendez, Nishchal Arya, Roland Hui, Noorain Noorani, Tony Xu
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
Teams today create and share more types of content than ever before. Their work might span text, images, audio, and video, and that content is likely scattered across countless apps and tools. That can make it hard to find answers and insights fast—which is why we built Dropbox Dash. Its context-aware AI brings all your content and tools together, ensuring you always have everything you need to know to get work done. With Dash, you get an AI assistant and search engine that actually understands you and your team.
A key part of this understanding is Dash’s multimodal processing capabilities. This is what enables Dash’s intelligent features to work across content types—including photos and videos. To help us push these capabilities even further, we recently welcomed AI startup Mobius Labs to Dropbox. Their multimodal models, collectively named Aana, offer an ultra-efficient architecture for understanding rich media at Dropbox scale with significantly lower computational requirements than conventional architectures.
Aana does more than just make multimedia content searchable; it enables applications to analyze and interpret complex scenes, recognize objects and actions, and more accurately put content in the context of related work. That means teams—whether in creative, technology, media, or other fields—can get the answers they need to do their jobs more quickly, without manually digging through folders, timelines, or tools.
Here’s a closer look at how Aana works and why we’re excited to be bringing Mobius Labs’ Aana models into Dash.
Audio, video, and images often hold valuable context—like design critiques, product demos, or client feedback—but they’re notoriously hard to search and organize. Understanding a one-hour video, for instance, means parsing scene changes, speaker shifts, on-screen text, objects, actions, and audio cues. Interpreting a collection of images poses similar challenges: recognizing who’s in them, what’s happening, and where and when it took place requires a nuanced grasp of visual detail. Think of the moment in Jurassic Park when the dinosaurs are revealed; all the essential information is communicated visually, but it’s the music, pacing, and brief dialogue that make it unforgettable.
Each of these modalities interacts intricately, often in different ways, with transcripts, shots, and other clues each following its own timeline and semantic boundaries. Extracting meaningful information from this rich but fragmented mix poses a significant challenge. It requires not only understanding each modality on its own—like what’s being said in the audio or shown in the video—but also understanding how those modalities relate to one another. In other words, systems need to capture how sound, visuals, and language combine to create meaning within a scene. Furthermore, doing that across exabytes of content quickly becomes cost-prohibitive without carefully designed systems.
This is where Aana comes in. It takes in files of all kinds—demo videos, audio interviews, photo libraries—and analyzes them together. Unlike systems that treat text, images, audio, and video as separate streams, Aana looks at how they relate to one another, revealing patterns and insights that emerge only when these modalities are combined.
Under the hood, Aana combines open-source, fine-tuned foundation models for speech, vision, and language—continuously evaluated and updated as new releases emerge. For audio, it uses inference-optimized Whisper-like models developed with open-source collaborators, such as the faster-whisper-large-v3-turbo model. Its vision and language systems rely on transformer-based and mixture-of-experts (MoE) architectures engineered for fast, cost-effective inference on off-the-shelf GPUs. The team works closely with the open-source community to benchmark and integrate the latest advances, improving performance and efficiency over time. The entire system is built to strike an optimal balance, delivering high-quality multimodal understanding while keeping compute demands low.
With this foundation in place, Aana can do more than just recognize what’s happening in a scene—it can understand how that scene evolves. Aana follows how objects move, actions unfold, and layouts change over time. It can even connect insights across modalities, like pinpointing the exact moment in a video when someone walks to a whiteboard to explain a diagram. All of this information is distilled into a shared vector space, enabling fast, multimodal search. The result is a system that understands context. You can ask for “the part where the presenter explains the API flow” instead of scrubbing through timestamps or relying on basic metadata.
Behind this capability is an efficiency-focused architecture. Aana employs advanced inference optimizations that make large-scale multimodal understanding feasible. Its HQQ system enables low-bit (8-bit and 4-bit) inference for dramatically lower compute and memory costs, while Gemlite accelerates core AI operations like matrix multiplications and attention layers with custom GPU kernels.
These optimizations are orchestrated by the Aana SDK, which handles batching, model coordination, and efficient GPU utilization. The SDK also serves as a flexible framework for building and deploying multimodal applications, allowing multiple AI models to collaborate seamlessly while maintaining performance and scalability. Teams can configure, compose, and deploy different model setups and processing pipelines into production, making it easy to experiment, optimize, and scale new multimodal workflows with minimal overhead.
Together, these optimizations mean Aana can analyze exabytes of information with just a fraction of the compute footprint of traditional architectures. For teams working with large volumes of rich media, that opens the door to entirely new possibilities—from surfacing a specific visual motif across a company’s creative archive to summarizing years of client meetings into concise, searchable highlights.
Dropbox has long been trusted by innovators and creative professionals who bring ideas to life, from musicians and filmmakers to designers, engineers, and marketers. As teams work across more formats and tools, their creative process depends on understanding content in context. That’s what multimodal tools like Dash make possible: When AI understands your work—wherever that work happens, and whatever format it’s in—you can spend less time managing your content and more time actually creating it.
We’re excited to welcome Mobius Labs’ team of multimodal experts to Dropbox. Bringing Aana’s capabilities into Dash won’t just help us make visual and audio content more searchable; it’ll provide foundational support for agentic workflows that can analyze and interpret multimedia data, surface insights automatically, and even take action on behalf of teams. For marketing, creative, and technical organizations alike, this means turning large collections of media into connected, searchable knowledge that helps teams find answers, generate ideas, and move work forward.
To learn more about Dash visit dropbox.com/dash.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
Editor's note: We are republishing a blog post from the Mobius team, originally published in 2023, that introduced a now widely used quantization algorithm. We plan to continue this line of work by collaborating with the open source community on inference optimization and will be sharing more updates soon.
~ ~ ~
Large Language Models (LLMs) have revolutionized various subfields of machine learning like natural language processing, speech recognition and computer vision, enabling machines to understand and generate outputs with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. Quantization methods such as bitsandbytes, GPTQ and AWQ have made it possible to use large models such as the popular Llama-2 with significantly less memory, enabling the machine learning community to conduct remarkable research using a single consumer-grade GPU.
In this article, we propose a new quantization technique called Half-Quadratic Quantization (HQQ). Our approach, requiring no calibration data, significantly speeds up the quantization of large models, while offering compression quality competitive with that of calibration-based methods. For instance, HQQ takes less than 5 minutes to process the colossal Llama-2-70B, that’s over 50x faster compared to the widely adopted GPTQ. Our Llama-2-70B quantized to 2-bit outperforms the full-precision Llama-2-13B by a large margin for a comparable memory usage.
Model quantization is a crucial step to deploy large models with limited resources and save costs, which is particularly relevant to LLMs for both training and inference. Software packages such as bitsandbytes have made it possible to utilize large models on consumer-grade GPUs, which has been a game-changer for the machine learning community.
When it comes to weight-only quantization, there are two classes of approaches: data-free calibration techniques such as bitsandbytes rely on using the weights only without external data, and calibration-based methods such as GPTQ and AWQ that rely on an external dataset. While calibration-based methods offer better quantization quality, they suffer from two main issues:
Wouldn't it be great if we can achieve the quality of calibration-based methods for the speed of calibration-free quantization methods? That’s exactly what we propose via our method Half-Quadratic Quantization (HQQ).
Basic quantization often results in a loss of model accuracy. This is because the weights in these models can have a wide range of values that can be significantly altered after the quantization process. Weights that deviate from the distribution, notably known as outliers, pose a particular challenge. Group-wise Precision Tuning Quantization (GPTQ) and Activation-Aware Layer Quantization (AWQ) are algorithms that try to overcome this issue by relying on calibration data to minimize the error on layer outputs.
Unlike these approaches, our method focuses specifically on minimizing errors in the weights rather than the layer activation. Additionally, by incorporating a sparsity-promoting loss, such as the \( {l_{p<1}} \)-norm, we effectively model outliers through a hyper-Laplacian distribution. This distribution more accurately captures the heavy-tailed nature of outlier errors compared to the squared error, resulting in a more nuanced representation of error distribution.
We propose a robust optimization formulation to find the quantization parameters (zero-point \( z \) and scaling \( s \)). More specifically, we use a sparsity-promoting loss function \( \phi() \) such as the \( {l_{p}} \) norm between the original weights \( W \) and their dequantized version:
$$\underset{z,s}{\text{argmin}}\,\phi(W-Q_{z,s}^{-1}(Q_{z,s}(W)),$$
where \( Q_{z,s}() \) is the quantization operator which depends on the \( z \) and \( s \) parameters and generates the quantized weights \( W_{q} \). \( Q_{z,s}()^{-1} \) is the de-quantization operator:
$$\begin{array}{c} Q_{z,s}(W)=\text{round}(W/s+z)=W_{q}\\ Q_{z,s}^{-1}(W_{q})=s(W_{q}-z) \end{array}$$
The use of the \( {l_{p<1}} \)-norm makes the problem non-convex. To find a solution, we adopt a Half-Quadratic solver by introducing an extra variable \( W_{e} \). This additional parameter allows us to split the main problem into sub-problems that are easier to solve. Moreover, to make the problem simpler, we fix the scaling parameter \( s \) and only optimize for the zero-point \( z \).
$$\underset{z,W_{e}}{\text{argmin}}\,\phi(W_{e})+\frac{\beta}{2}||W_{e}-(W-Q_{z}^{-1}(Q_{z}(W))||_{2}^{2}$$ We then form sub-problems which are solved via alternate optimization: $$\begin{array}{cc} \text{(sp}_{1}) & W_{e}^{(t+1)}\leftarrow\underset{}{\underset{W_{e}}{\text{argmin}}\,\phi(W_{e})+\frac{\beta^{(t)}}{2}||W_{e}-(W-Q_{z}^{-1}(Q_{z}(W))||_{2}^{2}}\\ \text{(sp}_{2}) & z^{(t+1)}\leftarrow\underset{z}{\text{argmin}}\,\frac{1}{2}||Q_{z}^{-1}(Q_{z}(W))-(W-W_{e}^{(t+1)})||_{2}^{2}\\ & \beta^{(t+1)}\leftarrow\kappa\beta^{(t)},\end{array}$$ where \( \beta \) and \( \kappa \) and strictly positive parameters.
This problem takes the form of a Proximal Operator. When \( \phi() \) is the \( l_{1} \) norm, the solution is the soft-thresholding operator. There exists a more general thresholding solution for the \( l_{p}\)-norm with \( 0 \le p \leq 1 \) that we adopt known is as the generalized soft-thresholding operator:
$$\begin{array}{c} W_{e}^{(t+1)}\leftarrow\text{shrink}_{l_{p}}\left(W-Q_{z}^{-1}(Q_{z}(W)),\beta\right)\\ \text{shrink}_{l_{p}}(x,\beta)=\text{sign}(x)\text{relu}(|x|-\frac{|x|^{p-1}}{\beta}) \end{array}$$
The second sub-problem can be rewritten as follows:
$$\begin{array}{c} z^{(t+1)}\leftarrow\underset{z}{\text{argmin}}\,\frac{1}{2}||z-\left(W_{q}^{(t+1)}-\frac{(W-W_{e}^{(t+1)})}{s}\right)||_{2}^{2}\\ W_{q}^{(t+1)}=\text{round}(W/s+z^{(t)}) \end{array}$$
The solution is simply the average over the axis the quantization grouping is performed on:
$$z^{(t+1)}\leftarrow\langle W_{q}^{(t+1)}-\frac{(W-W_{e}^{(t+1)})}{s}\rangle$$
In our implementation, we work with the inverse of the scale \( 1/s \) instead of \( s \) which we found to be a bit more stable with the half-precision calculations.
Note that, contrary to using gradient descent with autograd, the approach that we propose relies on closed-form solutions, which means that there are no gradients calculated. This allows us to run all the calculations in inference mode with half-precision. Moreover, it only takes a few iterations for the solver to converge. Conversely, using the AdamW optimizer and Pytorch’s autograd takes thousands of iterations to achieve good results. It also fails with \( p < 1 \), which is what we actually use to promote sparsity. Thanks to the Half-Quadratic solution, our quantization method achieves significant speed-up (over 100x faster vs. autograd to quantize Llama-2-7B), being able to process even the largest models in only a few minutes.
We report the processing time to quantize the Llama-2 models. We noticed that the processing time for GPTQ and AWQ drastically changes from one machine to another. Our method performs the whole quantization on the GPU with half-precision and only uses the CPU to transfer data to the GPU once the solver is finished. HQQ takes only a few minutes to quantize the largest Llama-2-70B model, which is over 50x faster compared to GPTQ.
To measure the quantization quality of our method, we use the perplexity metric (PPL) on the widely adopted wikitext2 dataset. We also report the runtime GPU memory in GB (MEM) the session takes to run the quantized model (additional memory is required for prediction depending on the sequence length). We compare against the popular approaches widely used by the community: BNB (bitsandbytes), GPTQ via AutoGPTQ and AWQ via AutoAWQ.
Regarding the parameters, we fix the Half-Quadratic solver with the following: p=0.7, beta=1, kappa=1.01, iterations=20. Additionally, we use early-stopping to exit the solver when the error doesn’t improve. We haven’t experimented much with the parameters, so different settings might actually yield better results. Similar to the other approaches, we use grouping to quantize the weights into buffers (_g128 means we use a group-size of 128). We also quantize the zero-point into 8-bit without grouping or optimization.
| Method | nBits | Llama-2-7B | Llama-2-13B | Llama-2-70B | |||
|---|---|---|---|---|---|---|---|
| PPL ↓ | MEM ↓ | PPL ↓ | MEM ↓ | PPL ↓ | MEM ↓ | ||
| FP | 16 | 5.18 | 13.5 | 4.63 | 25.6 | OOM | OOM |
| BNB | 8 | 5.22 | 7.9 | 4.67 | 14.4 | 3.17 | 68.15 |
| GPTQ_g128 | 8 | 5.19 | 7.8 | 4.63 | 14.8 | 3.12 | 74.87 |
| HQQ_g128 | 8 | 5.19 | 7.6 | 4.63 | 14 | 3.12 | 69.32 |
| BNB_g64 | 4 | 5.43 | 4.7 | 4.79 | 8.2 | 3.29 | 39.11 |
| GPTQ_g128 | 4 | 5.41 | 5 | 4.74 | 8.9 | 3.24 | 40 |
| GPTQ_g64 | 4 | 5.38 | 5 | 4.73 | 9.1 | 3.23 | 41.13 |
| AWQ_g128 | 4 | 5.32 | 4.6 | 4.71 | 8.2 | 3.21 | 35.78 |
| AWQ_g64 | 4 | 5.28 | 4.6 | 4.7 | 8.5 | 3.2 | 37.08 |
| HQQ_g128 | 4 | 5.35 | 4.6 | 4.74 | 7.9 | 3.21 | 35.97 |
| HQQ_g64 | 4 | 5.3 | 4.6 | 4.7 | 8.2 | 3.19 | 37.52 |
| GPTQ_g128 | 3 | 6.3 | 3.9 | 5.25 | 7 | 3.85 | 33.7 |
| GPTQ_g64 | 3 | 6.1 | 4 | 5.16 | 7.3 | 3.7 | 33.47 |
| HQQ_g128 | 3 | 6.2 | 3.8 | 5.15 | 6.8 | 3.58 | 30.11 |
| HQQ_g64 | 3 | 5.82 | 4.5 | 4.98 | 7.4 | 3.45 | 33.46 |
| GPTQ_g64 | 2 | nan | 3.5 | 13 | 6 | 9.44 | 24.5 |
| HQQ_g32 | 2 | 15.61 | 3.5 | 7.63 | 5.9 | 4.82 | 24.2 |
| HQQ_g16 | 2 | 7.3 | 4.1 | 6.36 | 6.9 | 4.12 | 30.27 |
| HQQ_g16_s* | 2 | 7.31 | 3.7 | 6.37 | 6.1 | 4.13 | 26.37 |
*: the scaling is also quantized to 8-bits with a group-size of 128.
As illustrated in the table above, our method showcases strong performance without the need for calibration data. When applied to larger models like the Llama-2-70B, 2-bit quantization via HQQ achieves a lower perplexity than the full-precision Llama-2-13B, all while requiring a comparable level of memory usage.
The interactive graph below summarizes the various data points into a scatter plot. Hover or click on a bubble to display the details.
We evaluate the effectiveness of our quantization method on vision models as well. More specifically, we quantize various OpenCLIP models from the Visual Transformers (ViT) family trained on the LAION dataset. Since Auto-GPTQ and Auto-AWQ calibration only works with text inputs, we can only evaluate against bitsandbytes by replacing all the linear layers inside the transformer blocks with their quantized versions.
We conduct two sets of benchmarks and report the top-1 and top-5 accuracy on the ImageNet dataset. The first benchmark consists in measuring the zero-shot performance of the quantized models. We use the OpenAI prompts to generate zero-shot classifiers by averaging the text features over all the templates. This benchmark directly measures the quality of the quantized models since there's no training involved in the evaluation process. The second benchmark uses the quantized models as a frozen backbone and trains a linear Softmax classifier on top of the features. This is referred to as Linear Probing and measures the quality of the quantized model as a frozen backbone. All results can be found in the table below:
| Method | nBits | Model | Linear (top-1) | Linear (top-5) | 0-shot (top-1) | 0-shot (top-5) |
|---|---|---|---|---|---|---|
| FP | 16 | ViT-B-32 | 0.764 | 0.941 | 0.664 | 0.896 |
| FP | 16 | ViT-L-14 | 0.82 | 0.964 | 0.731 | 0.93 |
| FP | 16 | ViT-H-14 | 0.841 | 0.973 | 0.772 | 0.949 |
| BNB | 8 | ViT-B-32 | 0.762 | 0.94 | 0.663 | 0.896 |
| HQQ | 8 | ViT-B-32 | 0.763 | 0.941 | 0.663 | 0.896 |
| BNB | 8 | ViT-L-14 | 0.82 | 0.964 | 0.731 | 0.93 |
| HQQ | 8 | ViT-L-14 | 0.82 | 0.964 | 0.731 | 0.93 |
| BNB | 8 | ViT-H-14 | 0.84 | 0.972 | 0.771 | 0.949 |
| HQQ | 8 | ViT-H-14 | 0.841 | 0.973 | 0.772 | 0.95 |
| BNB | 4 | ViT-B-32 | 0.733 | 0.925 | 0.608 | 0.859 |
| HQQ | 4 | ViT-B-32 | 0.75 | 0.933 | 0.639 | 0.881 |
| BNB | 4 | ViT-L-14 | 0.815 | 0.961 | 0.718 | 0.925 |
| HQQ | 4 | ViT-L-14 | 0.815 | 0.962 | 0.721 | 0.926 |
| BNB | 4 | ViT-H-14 | 0.837 | 0.971 | 0.766 | 0.947 |
| HQQ | 4 | ViT-H-14 | 0.839 | 0.973 | 0.769 | 0.948 |
| HQQ | 3 | ViT-B-32 | 0.664 | 0.881 | 0.481 | 0.753 |
| HQQ | 3 | ViT-L-14 | 0.799 | 0.954 | 0.689 | 0.909 |
| HQQ | 3 | ViT-H-14 | 0.831 | 0.969 | 0.755 | 0.943 |
| HQQ | 2 | ViT-B-32 | 0.318 | 0.551 | 0.04 | 0.106 |
| HQQ | 2 | ViT-L-14 | 0.731 | 0.917 | 0.559 | 0.815 |
| HQQ | 2 | ViT-H-14 | 0.808 | 0.96 | 0.716 | 0.924 |
As we can see, our method produces high-quality quantization models despite not using any calibration data. It outperforms 4-bit bitsandbytes (BNB) by a large margin on zero-shot performance (+3.1% top-1 accuracy with ViT-B-32). For extreme low-bit quantization, our ViT-H-14 quantized to 3-bit outperforms the full-precision ViT-L-14 (+2.4% top-1 zero-shot accuracy), and the 2-bit version outperforms the ViT-32-B full-precision by a large margin (+5.2% top-1 zero-shot accuracy).
The plot below summarizes the various accuracy numbers into an interactive scatter plot. Hover or click on a bubble to display the details.
This article demonstrates that calibration-free quantization through our proposed Half-Quadratic Quantization (HQQ) method can achieve a quality competitive with popular data-dependent methods like GPTQ and AWQ. We have demonstrated the effectiveness of HQQ even for extreme low-bit quantization across different model sizes and applications. Moreover, by leveraging efficient optimization techniques such as Half-Quadratic splitting, our method cuts the quantization time to only a few minutes even for the biggest models available such as Llama-2-70B.
We provide the code to reproduce all the results presented in this article: https://github.com/mobiusml/hqq
Ready-to-use quantized models can be found on our Hugging Face 🤗 page: https://huggingface.co/mobiuslabsgmbh
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
LLM applications present a deceptively simple interface: a single text box. But behind that minimalism runs a chain of probabilistic stages, including intent classification, document retrieval, ranking, prompt construction, model inference, and safety filtering. A tweak to any link in this chain can ripple unpredictably through the pipeline, turning yesterday’s perfect answer into today’s hallucination. Building Dropbox Dash taught us that in the foundation-model era, AI evaluation—the set of structured tests that ensure accuracy and reliability—matters just as much as model training.
In the beginning, our evaluations were somewhat unstructured—more ad-hoc testing than a systematic approach. Over time, as we kept experimenting, we noticed that the real progress came from how we shaped the processes: refining how models retrieved info, tweaking prompts, and striking the right balance between consistency and variety in answers. So we decided to make our approach more rigorous. We designed and built a standardized evaluation process that treated every experiment like production code. Our rule was simple: Handle every change with the same care as shipping new code. Every update had to pass testing before it could be merged. In other words, evaluation wasn’t something we simply tacked on at the end. It was baked into every step of our process.
We captured these lessons in a playbook that covers the full arc of datasets, metrics, tooling, and workflows. And because people don’t just work in text, evaluation must ultimately extend to images, video, and audio to reflect how work really happens. We’re sharing those findings here so that anyone working with LLMs today can replicate our evaluation-first approach for themselves.
To kick off our evaluations, we started with publicly available datasets to establish a baseline for retrieval and question answering performance. For question answering, we drew on Google’s Natural Questions, Microsoft Machine Reading Comprehension (MS MARCO), and MuSiQue. Each brought something different to the table: Natural Questions tested retrieval from very large documents, MS MARCO emphasized handling multiple document hits for a single query, and MuSiQue challenged models with multi-hop question answering. Choosing the right mix of datasets gave us early, useful signals on how our system and parameter choices would hold up.
But public datasets alone aren’t enough. To capture the long tail of real-world phrasing, we turned to internal datasets by collecting production logs from Dropbox employees dogfooding Dash. We built two kinds of evaluation sets from this data. Representative query datasets mirrored actual user behavior by anonymizing and ranking top internal queries, with annotations provided through proxy labels or internal annotators. And representative content datasets focused on the types of material our users rely on most: widely shared files, documentation, and connected data sources. From this content, we used LLMs to generate synthetic questions and answers spanning diverse cases: tables, images, tutorials, and factual lookups.
Together, these public and internal datasets gave us a stack of carefully curated queries and answers that mirrored real-world chaos. Great! But datasets alone are just an inert mass until you wrap them in scoring logic. The next step was to turn these examples into a live alarm system, where each run clearly signals success or failure, with success defined through metrics, budget limits, and automated checks before the first experiment even begins.
When evaluating outputs from conversational AI systems, it’s tempting to reach for the usual suspects like BLEU, ROUGE, METEOR, BERTScore, and embedding cosine similarity. These offline metrics are well understood, quick to compute, and have been the backbone of benchmarking for natural language processing for years. But when applied to real-world tasks—for example, retrieving a source-cited answer, summarizing an internal wiki, or parsing tabular data—they quickly run out of steam.
Here’s what traditional metrics can (and can’t) tell you:
| Metric | Does well on | Fails on |
|---|---|---|
| BLEU | Exact word overlap | Paraphrasing, fluency, factuality |
| ROUGE | Recall-heavy matching | Source attribution, hallucination |
| BERTScore | Semantic similarity | Granularity of errors, citation gaps |
| Embedding sim | Vector-space proximity | Faithfulness, formatting, tone |
We used these metrics early on for quick checks, and they were useful for catching egregious cases where the model drifted wildly. But they couldn’t enforce deployment-ready correctness. We’d see high ROUGE scores even when an answer skipped citing its source, strong BERTScore results alongside hallucinated file names, and fluent Markdown outputs that still buried factual errors in the middle of a paragraph. These failures aren’t rare; they’re the norm when deploying AI in production. So we asked a better question: What if we used LLMs themselves to grade the outputs?
Enter the LLM as a judge
Using one LLM to evaluate another may sound recursive, but it unlocks real flexibility. A judge model can check for factual correctness against ground truth or context, assess whether every claim is properly cited, enforce formatting and tone requirements, and scale across dimensions that traditional metrics simply ignore. The key insight is that LLMs are often better at scoring natural language when you frame the evaluation problem clearly.
Just as important, we learned that rubrics and judge models themselves need evaluation and iteration. Prompts, instructions, and even the choice of judge model can change outcomes. In some cases, like evaluating specific languages or technical domains, we rely on specialized models to keep scoring fair and accurate. In other words, evaluating the evaluators became part of our own quality loop.
How we structure LLM-based evaluation
We approached our LLM judges as if they were software modules: designed, calibrated, tested, and versioned. At the core sits a reusable template. Each evaluation run takes in the query, the model’s answer, the source context (when available), and occasionally a hidden reference answer. The judge prompt then guides the process through a structured set of questions, such as:
The judge responds with both a justification and a score that’s either scalar or categorical, depending on the metric. For example, a rubric output might look like this:
{
"factual_accuracy": 4,
"citation_correctness": 1,
"clarity": 5,
"formatting": 4,
"explanation": "The answer was mostly accurate but referenced
a source not present in context."
}
Every few weeks, we ran spot-checks on sampled outputs and labeled them manually. These calibration sets gave us a way to tune the judge prompts, benchmark agreement rates between humans and models, and track drift over time. Whenever a judge’s behavior diverged from the gold standard, we updated either the prompt or the underlying model.
While LLM judges automated most of the coverage, human spot-audits remained essential. For each release, human engineers manually reviewed 5–10% of the regression suite. Any discrepancies were logged and traced back to either prompt bugs or model hallucinations, and recurring issues triggered prompt rewrites or more fine-grained scoring.
To make this system enforceable, we defined three types of metrics, each with a clear role in the development pipeline:
| Metric type | Examples | Enforcement logic |
|---|---|---|
| Boolean gates | “Citations present?”, “Source present?” | Hard fail changes can’t move forward |
| Scalar budgets | Source F1 ≥ 0.85, p95 latency ≤ 5s | Block deploying any changes that affect the test |
| Rubric scores | Tone, formatting, narrative quality | Logged in dashboards; monitored over time |
Every new model version, retriever setting, or prompt change was checked against these dimensions. If performance slipped below the thresholds, the change didn’t move forward. And because metrics only matter when they’re built into the workflow, we wired them into every stage of development. Fast regression tests ran automatically on every pull request, the full suite of curated datasets ran in staging, and live traffic was continuously sampled and scored in production. Dashboards consolidated the results, and that made it easy to see key metrics, pass/fail rates, and shifts over time.
With this setup, the same evaluation logic gated every prompt tweak and retriever update. The result is consistency, traceability, and reliable quality control.
Once we had datasets and metrics in place and had gone through a few cycles of building, testing, and shipping, it became clear we needed more structure. Managing scattered artifacts and experiments wasn’t sustainable. That’s when we adopted Braintrust, an evaluation platform we’ll dive into shortly. It brought structure to our workflows by helping us manage datasets, scorers, experiments, automation, tracing, and monitoring.
At its core, the platform gave us four key capabilities. First, it gave us a central store, meaning a unified, versioned repository for datasets and experiment outputs. Second, it provided us with an experiment API where each run was defined by its dataset, endpoint, parameters, and scorers, producing an immutable run ID. (We built lightweight wrappers to make managing these runs simple.) Third, it offered dashboards with side-by-side comparisons that highlighted regressions instantly and quantified trade-offs across latency, quality, and cost. And finally, it gave us trace-level debugging, where one click revealed retrieval hits, prompt payloads, generated answers, and judge critiques.
Spreadsheets were fine for quick demos, but they broke down fast once real experimentation began. Results were scattered across files, hard to reproduce, and nearly impossible to compare side by side. If two people ran the same test with slightly different prompts or model versions, there was no reliable way to track what changed or why. We needed something more structured, and we needed a shared place where every run was versioned, every result could be reproduced, and regressions surfaced automatically. That’s what an evaluation platform gave us: the ability to reproduce, compare, and debug together without slowing down.
We treated prompts, context selection settings, and model choices just like any other piece of application code, meaning they had to pass the same automated checks. Every pull request kicked off about 150 canonical queries, which were judged automatically and returned results in under 10 minutes. Once a pull request was merged, the system reran the full suite along with quick smoke checks for latency and cost. If anything crossed a red line, the merge was blocked.
| Dev event | Trigger | What runs | SLA |
|---|---|---|---|
| Pull request opened | GitHub Action | ~150 canonical queries, judged by scorers | Results return in under ten minutes |
| Pull request merged | GitHub Action | Canonical suite plus smoke checks for latency and cost | Merge blocked on any red‑line miss |
These canonical queries were small in number but carefully chosen to cover critical scenarios: multiple document connectors, “no-answer” cases, and non-English queries. Each test recorded the exact retriever version, prompt hash, and model choice to guarantee reproducibility. If scores dropped below a threshold—for example, if too many answers were missing citations—the build stopped. Thanks to this setup, regressions that once slipped into staging were caught at the pull-request level.
On-demand synthetic sweeps
Large refactors or engine updates could hide subtle regressions, so we ran end-to-end evaluation sweeps to catch them early. These sweeps began with a golden dataset and could be dispatched as a Kubeflow DAG, running hundreds of requests in parallel. (A Kubeflow DAG is a workflow built in Kubeflow Pipelines, an open-source ML platform, where the steps are organized as a directed acyclic graph.) Each run was logged under a unique run_id, making it easy to compare results against the last accepted baseline.
We focused on RAG-specific metrics such as binary answer correctness, completeness, source F1—an F1 score applied to retrieved sources, measuring how well the system balances precision (retrieving only relevant documents) and recall (retrieving all relevant ones)—and source recall. Any drift beyond predefined thresholds was flagged automatically. From there, LLMOps tools let us slice traces by retrieval quality, prompt version, or model settings, helping pinpoint the exact stage that shifted so we could fix it before the change ever reached staging.
Live-traffic scoring
Offline evaluation is critical, but real user queries are the ultimate test. To catch silent degradations as soon as they happened, we continuously sampled live production traffic and scored it with the same metrics and logic as our offline suites. (All of our work at Dropbox is guided by our AI principles.) Each response, along with its context and retrieval trace, was logged and routed through automated judgment, measuring accuracy, completeness, citation fidelity, and latency in near real time.
Dashboards visible to both engineering and product teams tracked rolling quality and performance medians; for example, over one-hour, six-hour, and 24-hour intervals. If metrics drifted beyond a set threshold, such as for a sudden drop in source F1 or a spike in latency, alerts were triggered immediately so the team could respond before end users were affected. Because scoring ran asynchronously in parallel with user requests, production traffic saw no added latency. This real-time loop let us catch subtle issues quickly, close the gap between code and user experiences, and maintain reliability as the system evolved.
Layered gates
To control risk as changes moved through the pipeline, we used layered gates that gradually tightened requirements and brought the evaluation environment closer to real-world usage. The merge gate ran curated regression tests on every change, and only those meeting baseline quality and performance passed. The stage gate expanded coverage to larger, more diverse datasets and applied stricter thresholds, checking for rare edge cases. Finally, the production gate continuously sampled real traffic and scored it to catch issues that only emerged at scale. If metrics dipped below thresholds, automated alerts were fired and rollbacks could be triggered immediately.
By progressively scaling dataset size and realism at each gate, we blocked regressions early while ensuring that staging and production evaluations stayed closely aligned with real-world behavior.
Evaluation isn’t a phase; it’s a feedback loop. A system that learns from its own mistakes evolves faster than any roadmap allows. Gates and live-traffic scoring provide safeguards, but to build resilient AI systems, evaluation also has to drive continuous learning. Every low-scoring output, flaky regression, or drifted metric isn’t just a red flag. Rather, it’s a chance to improve the system end to end. This is where the loop closes and the next cycle begins.
Every poorly scored query carries a lesson. By mining low-rated traces from live traffic, we uncovered failure patterns that synthetic datasets often missed: retrieval gaps on rare file formats, prompts cut off by context windows, inconsistent tone in multilingual inputs, or hallucinations triggered by underspecified queries. These hard negatives flowed directly into the next dataset iteration. Some became labeled examples in the regression suite, while others spawned new variants in synthetic sweeps. Over time, this built a virtuous cycle where the system was stress-tested on exactly the edge cases where it once failed.
Not every change was ready for production gates, especially riskier experiments like a new chunking policy, a reranking model, or a tool-calling approach. To explore these safely, we built a structured A/B playground where teams could run controlled experiments against consistent baselines. Inputs included golden datasets, user cohorts, or synthetic clusters. Variants covered different retrieval methods, prompt styles, or model configurations. Outputs spanned trace comparisons, judge scores, and latency and cost budgets. This safe space let tweaks prove their value, or fail fast, without consuming production bandwidth.
LLM pipelines are multi-stage systems, and when an answer failed, guessing was costly. To speed up debugging, we invested in playbooks that guided engineers straight to the likely cause. Was the document never retrieved? Check the retrieval logs. Was context included but ignored? Review the prompt structure and truncation risk. Did the answer fail because the judge mis-scored it? Re-run against the calibration set and human labels. These playbooks became part of triage, ensuring regressions were traced systematically rather than debated.
Finally, the cultural piece: Evaluation wasn’t owned by a single team. Instead, it was embedded into everyday engineering practice. Every feature pull request linked to evaluation runs. Every on-call rotation had dashboards and alert thresholds. Every piece of negative feedback was triaged and reviewed. And every engineer owned the impact of their changes on quality, not just correctness. Speed mattered when shipping new products, but the cost of mistakes could be high. Predictability came from guardrails, and those guardrails were evaluation.
When we first set out, our prototypes were stitched together with whatever evaluation data we had available. That was fine for quick demos, but once real users started asking real questions, the cracks showed.
Tiny prompt tweaks led to surprise regressions. Product managers and engineers debated whether an answer was good enough, each using their own mental scoreboard. And the worst part? Problems slipped past staging and into production because nothing was catching them.
The solution wasn’t more heroics; it was structure. We created a central repository for datasets and ran every change through the same Braintrust-powered evaluation flows. Automated checks became our first line of defense, catching missing citations or broken formatting before code could merge. Shared dashboards replaced hallway debates with real numbers, visible to both engineering and product teams.
One of the biggest surprises was how many regressions came not from swapping models but from editing prompts. A single word change in an instruction could tank citation accuracy or formatting quality. Formal gates, not human eyeballs, became the only reliable safety net. We also learned that judge models and rubrics aren’t set-and-forget assets. Their own prompts need versioning, testing, and recalibration. In some cases, like evaluating responses in other languages or niche technical domains, we found that a specialized judge was the only way to keep scoring fair and accurate.
The takeaway is that evaluation isn’t a sidecar to development. Treat your evaluation stack with the same rigor you give production code, and you’ll ship faster, safer, and with far fewer "how’d this get through?” moments.
Our current stack catches regressions and keeps quality steady, but the next frontier is making evaluation proactive rather than purely protective. That means moving beyond accuracy to measure things like user delight, task success, and confidence in answers. It means building self-healing pipelines that can suggest fixes when metrics drop, shortening the debug loop. And it means extending coverage beyond text to images, audio, and low-resource languages, so evaluation reflects the way people actually work.
The goal is simple: Keep raising the bar so evaluation doesn’t just guard the product but pushes it forward. By treating evaluation as a first-class discipline—anchored in rigorous datasets, actionable metrics, and automated gates—we can turn probabilistic LLMs into dependable products.
Acknowledgments: Dhruvil Gala, Venkata Prathap Reddy Sudha, April Liu, and Dongjie Chen
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles and learn about creating a more enlightened way of working.
Hack Week 2025 at Dropbox centered on the theme “Keep It Simple,” offering opportunities for innovation, experimentation, and finding smart solutions to complex challenges. With in-person hubs in San Francisco, Seattle, and Warsaw—as well as the option to hack virtually—the July event brought together Dropbox developers to explore new ideas and build projects that could shape future products and workflows for tools like Dropbox Dash.
One standout effort, “Liquid Cooling CPU/GPU Hardware,” earned the Learn Fast award for accelerating learning and innovation. The team—Bobby Woolweaver, Daniel Coultas, Eddie del Rio, Eric Shobe, and Daniel Parker-Focht—designed a custom liquid cooling system for high-powered GPU servers to tackle the rising thermal demands of AI workloads. They built a lab setup, tested core components, and demonstrated significant benefits: 20–30°C lower operating temperatures under stress, quieter performance than air cooling, and the potential for power savings and environmental benefits.
Forward-looking in scope, the project explores next-generation GPU servers that may require liquid cooling due to increases in power consumption and heat generation. The team plans to expand testing with more liquid cooling labs in multiple data centers. We sat down with systems engineer Bobby Woolweaver and data center engineer Daniel Coultas to discuss their award-winning project and what it could mean for the future of infrastructure at Dropbox.
Your experimental lab took home our Learn Fast award this year. Walk us through what you built and how.
Daniel: For the Hack Week project, we built our own liquid cooling system from scratch. Normally, these systems come pre-assembled with pumps, radiators, and fans, and you just plug them in. But since we had trouble sourcing a complete system in time, we decided to put one together ourselves. We used scaled-down versions of the same core components you’d see in a data center liquid cooling setup: radiators to exhaust heat, fans, a pump, a reservoir, tubing, manifolds, and some basic sensors. The sensors were key so we could monitor performance and make sure everything was pumping correctly before we connected any expensive GPUs. Once that was in place, we hooked it up to the server itself.
What thermal performance observations did you make while working on this project?
Bobby: In terms of immediate thermal benefits, we saw a big difference. When running workloads on the liquid-cooled setup compared to our current air-cooled production system, temperatures were around 20–30°C lower under heavy stress tests. Though these were torture-style tests—even harsher than what we’d normally see in production.
Another key part of Hack Week was having the dedicated time to experiment with fan configurations. Since liquid cooling handled the CPUs and GPUs, we were able to remove or run many fans at lower speeds. We still needed some airflow for other components like DIMMs and the network card at the back, but those draw much less power and run at lower thermals compared to the GPUs and CPUs. Daniel even suggested building a specific airflow baffle to direct cooling to exactly where it’s needed.
The liquid cooling team’s lab at Hack Week 2025
Liquid cooling has been around for years. What first sparked your interest in exploring it as a potential solution for Dropbox?
Daniel: Liquid cooling has been around for a while, and the industry has been actively experimenting with it. We’ve followed the technology closely, including by attending conventions like the Open Compute Project summit where it’s a big topic. Bobby and I have seen these setups before and thought, this is really interesting—how could we apply it to Dropbox? We’ve had that question in the back of our minds for years now, and now we’re finally turning it into something concrete.
Bobby: Right. But it’s not as simple as just plugging in a liquid-cooled server. We need the right infrastructure in place so that if future high-performance servers require it, we’ll be ready. This project was about building that foundation.
So the challenge you’re solving for is future-focused—preparing for next-gen hardware and higher power needs?
Daniel: Exactly. It’s about handling both individual server power draw and the overall data center footprint. As new servers demand more power, sticking with only air cooling would force us to spread them out over more space. With liquid cooling, we can stay efficient—using less space, less energy, and potentially lowering costs.
How might this technology fit into our current and future infrastructure strategy, particularly with respect to our focus on supporting AI workloads?
Bobby: We’re seeing a greater need today for new solutions, especially with GPUs and AI workloads. These systems draw a huge amount of power and generate significant heat. While vendors aren’t yet requiring liquid cooling for their top-tier GPUs, we know it’s on the horizon. Air cooling may soon only support mid-range options.
Daniel: And with Dropbox focusing more on AI initiatives, it gave us the push we needed. As we expand into GPU-heavy systems, it’s important to evaluate higher-powered setups. Hack Week was the perfect opportunity to explore that.
The prototype liquid-cooled GPU server
Hack Week is a self-driven initiative where engineers are encouraged to explore projects independently. What resourcing or support were you given to explore this project?
Daniel: Dropbox has always made it possible for us to experiment. I’ve always felt supported to try new ideas. Our team was really interested in liquid cooling, and since Bobby’s team shared that interest, we were able to secure some funding to kick the project off. It gave us the chance to dive in ourselves and really have fun with it. Of course, we still have to balance it with our regular work, but we’ve been empowered to make the time and space to do that.
Bobby: I’d say the same. Both of our teams strongly believe this is an area we need to invest in—to research, lay the groundwork, and be ready for what’s coming. Once we showed that, we received support all the way up to move forward. So it’s been great to have that backing and be able to push ahead.
In-person events like Hack Week are an important part of the Virtual First experience at Dropbox. You both had the chance to attend in person for the first time this year. What did you enjoy about the experience? What were some of the benefits of working together in a physical space?
Bobby: I enjoyed getting to hack with the team and connect with people across the company that I don’t usually see. Being in the same room made it easy to bounce ideas off each other and solve issues quickly. For us in physical infrastructure, we usually kick off projects or bring-ups on site at a data center so we can quickly work through challenges and issues that are a normal part of any new project.
Daniel: My experience is very similar to Bobby’s. Being in person, we have easy access to the minds of our peers. We can bounce ideas off them, pick up workflow improvements, and problem-solve very quickly.
Additional contributors to this project include Eric Shobe, Eddie del Rio, and Daniel Parker-Focht.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles, and follow @LifeInsideDropbox on Instagram and Facebook to see what it's like to create a more enlightened way of working.