Published on

What to Know About the Context Going into your Code LLM

Written by
Codeium Team
Thinking of how context affects results.

TL;DR Big reasons why real time context selection is a key aspect and differentiator to creating a good AI coding assistance product: context length caps, effect of focused context, and the control AI application developers have over this problem.

AI for Code

If you code, odds are you’ve at least tried some generative AI tool that promises to make you more productive and reduce all of the parts of software engineering that we do because we have to (but don’t really want to). There’s actually quite a few such tools - Github Copilot, ChatGPT, Amazon CodeWhisperer, SourceGraph Cody, TabNine, Codeium, etc. Sure there are considerations like cost and availability for desired language/IDE, but those aside, which AI assistant actually has the best quality suggestions?

To answer this question, we need to ask what such a tool can do to give you better quality outputs. There are a bunch of things that can be done ahead of time, such as improving the base model with architecture improvements (ex. Fill-in-the-Middle), or finetuning the model with your existing code to give you personalized suggestions that no generic model could (read more).

But unless you’re working on an open source project, a generic model probably knows nothing about the code that you are writing from a training perspective. At best it can guess, and when it works it is magical. Today, we want to talk about the aspect of AI code assistants that is done realtime to make the model more knowledgeable of your work - smart context collection and prompt building.

There is no way to know ahead of time exactly what the user will want to add to their code and what information would be needed from the codebase to create good suggestions for that task, but it should be obvious how improved context would improve the outputs of a code LLM. Let’s say you are a front-end engineer and are creating a new UI without any context of relevant shared components that already exist in some internal library, or a back-end engineer trying to write a SQL query with no context of the database schemas described in an imported file. Even if the model has been finetuned on your entire codebase, having very particular relevant context passed in as part of the model invocation will guarantee higher quality suggestions that will conform with your codebase.

The Context Building Dream

Ok, so relevant context is good, so how do we actually formulate this into a system that works?

Let’s start with analyzing what developers have as an “internal state” when they code. They obviously can see what else is in the open file, but they also have context on what is in imported files, what are in files that are in the same directory / nearby in the repository tree, and how similar tasks have been implemented elsewhere. Somehow our brains consolidate all of that context to know how to write code that efficiently completes the task on hand, uses internal utilities and data structures, and complies with best practices in the repository.

So in the perfect world, our tool would automatically extract all such relevant code (crawl imports and file structure, use embeddings to find similar tasks, etc) and shove all of that code as context into the prompt sent to our LLM.

The problem? These models have limited context lengths. An autocomplete model like Github Copilot only has a context of 2048 tokens (~150 lines of code), primarily for latency reasons (you want autocomplete suggestions to be near-instantaneous). Yes, models with larger and larger context lengths are coming out, but something like a GPT-4 might only be able to take a few files worth of context (32k tokens), not to mention the extreme latency delays and money required as you increase context.

Also, recent research has suggested that more context is not necessarily better! For example, this analysis suggests that models that support longer context lengths actually have a drop in performance! While perhaps initially nonintuitive, on deeper inspection it does make sense that added context of fluff or slightly unrelated content might actually confuse more than enhance. While not a perfect analogy, if I asked someone with no prior knowledge to give me a description of an elephant shrew given the description of a mouse, they might give a better answer than if I also include a description of an elephant. But backing up, what this research really is supporting is that it is more powerful to pass in less, focused context than more, generic context that might have one small part that is super relevant.

The context length consideration is the primary reason why ahead-of-time finetuning of the LLM is a key part to improving suggestion quality. At least then the model has seen everything to some degree. But this should absolutely be supplemented with a robust method to collect and select context to pass in real time. While the dream is to pass all context, the realistic dream is: we have the best possible context set that obeys a token limit constraint of the particular LLM.

Closing Thoughts on Context

Realtime context collection is a very exciting axis for coding AI assistant development because it is a system to build prompts automatically, which is required since autocomplete is a passive AI. Even for chat applications with user input, it would be a huge improvement to be able to pull in relevant context automatically, shifting effort from developer to system. Unlike model retraining, which takes a lot of time and effort, there is potentially a lot of value to juice from existing models simply by rapidly iterating on context collection. It is purely an engineering and experimentation problem.

Of course to iterate on it quickly with statistically significant results, you do need hundreds of thousands of developers to be providing feedback via A/B testing. At Codeium we already have one of the largest developer communities out of all AI products, and the next article will go into exactly how we at Codeium have already built a robust context building engine, which has led for our base system to generate significantly better suggestions than other tools, including other industry leaders such as GitHub Copilot.