Published on

Productionizing Context Aware Everything

Written by
Nick Moy, Anshul Ramachandran
A robot meditating and observing the universe.

TL;DR We at Codeium are always looking for ways to improve on our enterprise product to give our customers the best suggestions to maximize their productivity. Here, we discuss how we productionized our industry-leading code context collection and prompt building, in a way that actually makes sense for companies. Codeium for Enterprises is the only AI assistant tool that properly productionizes using realtime context.

Background on Realtime Context

First, we discussed why context collection is an important, yet tricky, part of an AI code assistance stack, and then we showed that if done properly, advanced context awareness leads to dramatically better performance, pushing the quality of Codeium’s code suggestions far past other industry leaders like GitHub Copilot.

So now that we have proven how well context retrieval can work, let us get into the details of what such a retrieval stack looks like. Given all of the popular discourse on LLMs, most people would converge on a stack that looks a little like this:

  • Indexing (ahead-of-time): First, chunk up your knowledge base into context-sized chunks overlapping chunks somewhat to account for badly placed splits. Then embed these chunks using some 3rd party embedding API (ex. OpenAI ADA embedding), and then store the embeddings in a vector database (ex. PineconeDB).
  • Retrieval (real-time): First, use your remote embedding API to embed the user chat message/query. Then, use cosine similarity to match the resulting embedding to the K nearest items in your vector database and put those items into the prompt in addition to the original input. Profit

Naturally, indexing + retrieval is the first thing we tried as well, and it worked! Well, it worked often enough if all we wanted to do was record some cool demo videos of cherry-picked examples. But at Codeium, we have actual users (hundreds of thousands of them), and our features have to work without cherrypicking. So we dug in to understand how to build a system that would address the unique challenges of context for coding.

Challenges with Context Retrieval for Coding Tasks

So let’s walk through some of these challenges, in no particular order.

Note that we are focused here on the retrieval system itself and how do we get the accuracy of this system high enough to be useful. None of these address training the autocomplete or chat model to actually use retrieved context as opposed to just shoving in the retrieved chunks and hoping the LLM can figure out what to do with them. But that’s a topic for another post.


Take a chat functionality for code assistance. Users often ask questions in natural language, but a lot of the relevant context is in code! It also goes the other way, where if you want to explain some code, the relevant context will be in documentation, which is natural language. The user-specified input is in a different domain from the context that you want to retrieve. Any production codebase in itself has a mix of natural language and programming languages, and off-the shelf embedding models don’t capture this cross-modal relationship well.

Context Fragmentation

In many applications and most academic benchmarks (eg. WikiQA) the “correct answer” is in exactly one place in one document. In code this is almost never the case. In a common, decently-structured codebase, often 80% of the nominal functionality of a file is imported from libraries that exist all over the codebase. Those libraries will be hard to find based on naive embedding search – a general-purpose library will contain very little information on all the ways in which it ends up being used. Somehow your retrieval system has to trace the relationships between files that have little overlapping content, to ensure it pulls all the right items.

To further complicate the context fragmentation problem, a question about the same overall topic, asked at two different levels of detail, might require completely different actual fragments of context to answer. “How does our debugger work” vs. “What library do we use to parse error stack traces in our debugger” touch on the same topic, but at very different scopes, requiring different levels of summarization. One might involve pulling the summary of a few classes that interact with each other. The latter might require pulling a specific method from within one of those classes.

Entity Rarity

Prior to embedding search, so-called “sparse” retrieval (a fancy word for word-matching algorithms) was state of the art for document retrieval. The reason why embeddings are hugely helpful to context retrieval systems is that they are capable of fuzzy matching, where there’s a fundamental underlying relationship between the search query and the target entity, even when none of the words used are the same. This way, a user can ask very vague questions without any reference to a specific filename, and still get reasonable responses.

However, research shows that for rare entities, and especially zero-shot entities that the embedding model did not see at training, embedding-based retrieval systems still significantly underperform traditional “Sparse” (word matching) based retrievers. The research goes into more detail on why, but as some intuition, imagine studying for a test: if you need to know a formula well enough to actually use it to solve a problem, you need to have seen the pattern more times than if you just want to know the the “gist” of an idea. In a similar way, rarer items will be less well represented by an embedding model.

Unfortunately, rare entities are very important for any coding related task. Code is full of them - unlike natural language, code is special in that you can use arbitrary strings of characters for an entity, or more interestingly, redefine what an entity means within the context of your repository, even if that differs from what it means in a different repository. In a very fundamental sense, this is the entire point of code. And once you define that new entity, its name may appear only a handful of times in a codebase: once when the function is defined and just a few more times when that function is called.

But to simultaneously raise the stakes, it is also very important for our system to get the specific entity correctly to provide a useful suggestion. For example, we need to know exactly what the parameter list of a function is, not that of a similar function that does something similar.

Latency Constraints

For autocomplete, completion requests are produced with every keystroke. The typical autocompletion takes 100ms, and you cannot really get a whole lot slower if you want to keep developers in the flow. So, if a context retrieval system is going to be used, it can’t add on much more than 20ms on average or it will start to negatively affect the user experience.

Performing an embedding of the input takes about 100ms, and if you were to use a remote service or embedding API, you would incur another 100ms or so in network latency to the remote service. The actual similarity calculation is pretty fast since it boils down to just a matrix multiplication, but the embedding makes it impossible to do for every autocomplete request.

Sparse heuristics on the other hand have less up-front compute cost, but this cost scales with the number of items that we need to retrieve and rank, and quickly becomes unmanageable over millions of records.

Productionizing Context Aware Everything

Of course we won’t just list out challenges without providing a little bit of insight in how we solved these issues to get the results demonstrated in the previous post.

  • Cross Domain: We trained our own embedding model specifically to map between natural language and code using docstrings as the natural language. This is actually what underlies our search capability, which, as an aside, we are slowly going to be folding directly into chat as chat becomes more “context aware” to be able to answer these direct search questions. With this, it is important for us to semantically parse concepts like docstrings from the code itself, and this led to our codeium-parse library. When we do embedding retrieval, we might get a hit on the docstring specifically to match the natural language, and that lets us retrieve the code snippet, which is what the user actually needs. When indexing a codebase, we often index multiple parts of the same code item, allowing us more opportunities to get a hit since we have semantically linked these parts via parsing.
  • Context Fragmentation: The first step is multi-key retrieval. Whether the input is a chat question or autocomplete input, we don’t just search via embedding the entire input. We break the input up into logical chunks and perform retrieval ranking against each of them, allowing us to hit all of the necessary fragments. On the indexing side, where we can, we try to parse semantically meaningful chunks of the text, such as a class name or method definitions, but to complement this, we always parse overlapping chunks of text as well. All of this semantic understanding of the chunks really helps us piece together meaningful fragments. Yes, it is great if you can retrieve a 60-line chunk of code that contains the first two lines of the chunk of code you care about. But it is not ideal. If that piece of code is inside of another object, or class, or namespace, it is important that we capture that relationship as well. The more we can capture the understanding a human brings to a codebase, the better.
  • Entity Rarity: We do something called “Hybrid retrieval,” which runs both embedding-based (i.e. dense) and sparse retrieval systems and combines the results. Research shows that so-called Hybrid approaches can significantly improve accuracy, both because of better recall, and better ranking among recalled items. It appears that dense and sparse retrievers provide complementary document recall, and complementary ranking of recalled items. So by implementing both and using a custom combination of Sparse-Dense similarity to decide what to retrieve, we are able to address entity rarity while still making the most of the power of embeddings to do fuzzy matching.
  • Latency Constraints: We have implemented a two-stage retrieval system, where across millions of potential entities, we asynchronously build a smaller subset cache of 100s or 1000s of potentially relevant entities based on what the user has done recently. Then synchronously for every autocomplete request we re-rank from this local cache of 100s or 1000s of entities. In this way we can get coverage of the whole codebase, but have a very low-latency impact at the time of an autocomplete request. Oh, and we don’t use a third party embedding API to save on that network latency cost.

There is a lot we can talk about along each of these avenues, with sub-complications and solutions, but for the sake of this blog post’s length, we will move along.

Bringing Context Aware Everything to Enterprises

This context retrieval system, which we have dubbed “Context Aware Everything”, has been a huge technical undertaking, and we have evidence that this will propel Codeium way past its competitors. However, just like any other technical improvement, if Context Aware Everything is not productionized in a way that satisfies the customer’s requirements, then it is nothing more than a passing technical interest. From a user’s point of view, context aware everything:

  • Has to work on their codebase: It is cool if it works on some other person’s repositories, but useful if it works on mine.
  • Has to work locally: Code IP is important, so ideally nothing would leave my control (and definitely not my company’s control).
  • Ideally works without buying extra software or requiring manual configuration: this system is just a way to make existing models perform better. If I have to pay some additional subscription, request a special embedding process, pay for a hosted vector database, pay for an OpenAI key, or manually configure things together, then that’s just unnecessary friction.

So, at Codeium, how did we address these needs? We compute the embeddings locally and do all of the ranked retrieval locally as well, which means it will work on your repository. This also adds protection over your IP (not shipping your codebase out to a third party embedding service), but if your company doesn’t want to send any code snippets out even for the model inference, our Enterprise plan is fully self-hosted, either on-prem or in your VPC. And don’t want to add some more paid software? Codeium is free, and context aware everything just comes with it automatically. No upselling, no configuration.

GitHub Copilot, Amazon Codewhisperer, Tabnine - none of these tools did context retrieval to any level of sophistication similar to Codeium. The only tool out there today that has a resemblance of this context awareness is Sourcegraph’s Cody, but it is not a solution that is productionized properly - it only operates on open-source repositories, uses third-party LLM APIs as a backend, and requires you to actually purchase Sourcegraph itself, which is an unrelated piece of software to these generative AI models.

Final Thoughts on Context Awareness

We are proud of the work we’ve done to continue to raise the bar on generative AI model quality in the coding space. By productionizing it properly, context aware everything just works for all developers immediately.

This announcement is the first in a series of features we’ll be launching whose entire goal is to make Codeium an expert on your codebase, so it can help you solve your problems even better. We will have more technical content related to context awareness coming out soon, such as how the reranking actually works, how we used this logic to also power natural language search capabilities, and what is next in creating “extra brains” that can help you throughout the software development life cycle.

If this sounds interesting to you and want to work on these kinds of problems, do contact us - we are hiring!