Published on

Latency, the Ultimate Gen AI Constraint

Written by
Kevin Lu, Nick Jiang, Devin Chotzen-Hartzell, Douglas Chen
Time is very valuable.

Why isn’t this suggestion right? Why is the model hallucinating? This code assistant has access to my entire codebase, why are there so many “errors”?

The short answer is an under-discussed constraint on AI assistant tools: latency.

In an ideal world, for every suggestion a tool like Codeium makes to a developer, we will be feeding the largest possible LLM with all of the codebase as context (and metadata about where the cursor is), thereby getting the highest quality suggestions possible. However, a baseline usability requirement of “low enough latency” complicates things. If a developer has to wait even a second for every autocomplete suggestion, then we might as well not be making a suggestion at all.

We will focus on autocomplete because as we mentioned in other blog posts, it is a dense, passive AI that gives developers thousands of suggestions a day by running on every keystroke. So while chat is useful and has a much cooler UX than autocomplete, the sheer magnitude of value that can be driven by autocomplete and other passive AI modalities dwarfs all instructive ones like chat. Because we focus on autocomplete, we can also quantify the effects of latency on system performance by using the golden metric of Characters per Opportunity introduced in our previous post. As we will see, latency is the primary cause of low feedback rate factor values.

Autocomplete Latency Breakdown

To make sense of the latency constraint, we need to analyze what happens under the hood when an AI autocomplete suggestion is made:

  1. There is some logic to collect the context from the IDE (and maybe elsewhere) that we want to pass into for the model inference
  2. This context is passed over network to a remote GPU that is “powerful enough”
  3. The actual inference happens
  4. The inference result is passed back over network to the client machine
  5. There is some logic to merge the inference result with the existing text properly

Let’s look at where latency comes up in each stage, but not in order of stages. For the actual inference (step 3), we need to look at the model. The larger the model, the longer an inference takes, and it essentially linearly scales with the number of parameters in the model. However, these transformer models are also what are termed “autoregressive,” which essentially means that the latency scales linearly with (the length of the input text + the length of the output text). So, we can’t use the most gigantic model and we also must have limited context.

With limited context, there will be latency in step 1 as well. It is no longer as simple as saying “pass it all in,” and there will likely need to be some logic in figuring out exactly what context is the most relevant to pass to the inference. Sure the code in the same file as the cursor is important, but what about in the rest of the directory? Imports? Open tabs? Perhaps somewhere completely different? Given how “distributed” code is as a source of knowledge, you will likely need some sort of advanced embedding-based retrieval mechanism, so there is meaningful latency there in both embedding the local context and retrieving others against a precomputed embedding index.

And then the network latency, steps 2 and 4. It luckily turns out that sending 25-100KB of text from California to India takes only ~250ms. It is fortunate we send quite “light” data, rather than images or videos. But there is a possibility that a client themselves has poor internet connection, and so they would incur a lot of latency that the application cannot control.

The final step 5 to merge the inference result is pretty much instantaneous. Sure the logic can get a little complicated to make the suggestions seem seamlessly integrated, but it isn’t anything computationally intensive.

Why You Shouldn’t Use a Third Party API

There already looks like a lot of problematic sources of latency, but it gets worse if you do not own both the application and server layers. A very common decision that LLM applications make today is to use a third party API rather than build a whole inferencing engine and manage hardware.

However this means that you add extra latency around the inference. You can send the request over the network to the API, but if they are doing any sort of rate limiting or scheduling logic (likely), you may add even more latency before you get a response. This is one of the main issues that Sourcegraph Cody is running into - they use third party LLMs for inference (and that too, very large LLMs), and so their autocomplete suggestions take seconds to produce, rendering the product borderline unusable.

This is all not mentioning the cost of using third party APIs. Speaking of costs…

Other Latency Constraints as you Scale

Even if you are running a production application with lots of users and you’ve decided to manage your own hardware and models for the previous reason, you start hitting a cost problem on that hardware. If you have high concurrency and want to have no additional latency between request coming in and performing an inference, then you will need a lot of hardware and that is expensive. Therefore, given likely hardware constraints, you may have to queue up requests, which is another, potentially nondeterministic, latency source.

Effects of Latency

Using the Characters per Opportunity metric, we can actually point to the effect of latency on the value of an AI autocomplete tool. How? We will look at just feedback rate, since that is the term most tied to latency. As a reminder, feedback rate is the percentage of suggestions that the tool attempts to provide are even seen by the developer. Feedback rate is primarily affected by latency (the suggestion isn’t produced in time before the developer has made the next keystroke) and suggestion confidence (the suggestion is produced in time, but the tool deems it not useful enough to show to a developer). With quality held constant, as the latency gets higher, the feedback rate drops.

We know decisions about a product require tradeoffs that may involve actually changing the model or its execution. For example, if you are producing longer suggestions, then the latency may drop the feedback rate, but perhaps the increase in average tokens per acceptance makes up for it. However, we are interested in the effect of latency without any confounding variables, and therefore are interested in what can be done generically that will reduce the latency and increase the feedback rate.

Solutions to Latency While Maintaining Quality

It turns out that working under these latency constraints while still maximizing quality becomes entirely an ML infrastructure problem. For those unaware of Codeium’s origin story, we started as an ML infrastructure company building large scale GPU virtualization and optimization software, managing tens of thousands of GPUs in public clouds. We realized that we could leverage this infrastructure knowledge to build a highly performant AI application, and thus Codeium was born. This multi-year head start with ML infrastructure is our not-so-secret edge over other products in the market, since it allows us to deliver much higher quality products under the constraints of the end usage patterns.

Here is a list of different things we have done to cut down on latency, in no particular order:

  • Smart model compilation: We have built our own GPU compiler for model inference to cleverly fuse kernels across layers and operations to make them run faster. We have been developing this for years, as we used it even in our previous business to maximize utilization and minimize latency for generic GPU workloads.
  • Model architecture that makes inference faster: We have made many tweaks to standard transformer architectures to make inference faster by leveraging our smart compilation methods even more.
  • Quantization: Most models store their weights as fp16 values (floating point 16, which means each weight takes 16 bits to store). That being said, arithmetic operations in fp16 are much more expensive latency-wise than if they were stored in int8 or int4 (8 and 4 bits respectively) and given that GPUs have a max amount of memory, using 16 bits per weight puts a cap on the size of a model that you can even serve on a single GPU because swapping in weights of a model in-and-out GPU memory during inference is a very large latency hit. While fp16 is very useful in training so that gradients are stable, for inference, this level of precision is not as necessary, and we have developed ways to “quantize” our models (i.e. map fp16 values to lower precision values) post-training to increase model throughput without losing much in performance.
  • Speculative decoding: Due to some intricacies on how sampling during inference is very memory bound, there are some tricks in using smaller “draft” models to generate a sequence of tokens and use this as a reference during the sampling of the larger model to speed up its inference. Andrej Karpathy has a great quick intuitive explanation here so check that out.
  • Model parallelism: As mentioned under quantization, a GPU has a max amount of memory, so even with every other trick, there is still an upper limit on the size of a model that can be served on a single GPU since latency forces storing all of the weights in memory simultaneously. Or, maybe not. We have built infrastructure and architected our models in such ways that we can actually parallelize different parts of the inference across multiple GPUs, allowing us to split the storage of weights on different GPUs, each with their own memory. This ends up giving a similar stellar latency with even larger models, which means higher quality suggestions.
  • Streaming: Most LLMs generate tokens until reaching a special “stop” token, and then return the entire output. Instead, we stream tokens so that if the system detects a desired early termination condition (ex. we have generated code up to the end of the current scope), we can return a response quicker, minimizing latency for this request, and cancel the rest of the generation, freeing up the compute for other requests.
  • Context caching: Context collection can be expensive in large codebases, but usually a lot of the relevant context doesn’t change much from individual keystroke to keystroke. We have built a lot of complex caching systems to minimize latency in the context retrieval stage of the generation while still maintaining a high retrieval quality.
  • Smart batching: This is very important as you get to scale of even just hundreds of concurrent users (as perspective, we often have tens of thousands of concurrent users on our individual plan). Doing every request individually while not paying a fortune for hardware will incur lots of latency due to request queue time. We have built infrastructure that can take multiple requests that come in at similar times and run the inference through all of them in parallel, even with variable length outputs.

And this is not even the full list! The rise of open source models have made it much more tantalizing to build in-house as opposed to buy an application, but we often see folks underestimate the infrastructure side of things, and quickly hit barriers and ceilings to performance.

Conclusion

By now it should be clear that we at Codeium take latency very seriously, and while no one should use Codeium because it is powered by the best infrastructure, the best infrastructure allows us to serve the best product, at a fraction of the cost. You may have heard that GitHub Copilot spends over $20 per developer per month in serving costs - we don’t, and that’s why we can offer a product for free. Codeium is not free because it is worse than other products - in fact, quite the opposite. Give it a shot:

Explore more

Jun 11, 2024enterprise7 min

A deeper dive into recent enterprise readiness capabilities: indexing access controls, subteam analy...

Apr 30, 2024enterprise1 min

WWT uses Codeium to accelerate their developers.

Apr 18, 2024company1 min

Codeium is the only AI code assistant in the 2024 Forbes AI 50 List.

Stay up to date on the latest Codeium & AI dev tool news

Please enter a valid email address