Code Completion Models
Large language models have been trained over billions of bytes of data to perform exactly one task extremely well: given the preceding N characters, predict the next one. The driving force behind the AI revolution we're currently experiencing is that being able to predict the character with high accuracy is an incredible superpower. It allows you to build chatbots like Bing and ChatGPT, copywriting assistants like Jasper, and code completion tools like Codeium and Copilot.
The models powering code completion tools know how to complete entire functions just from their signatures:
They can see your imports and predict what task you're trying to complete:
But there's a problem: the model only knows about the code before your cursor. What about everything that's after? The existing code there can be incredibly useful when programming, providing information about potential functions to call, coding practices to emulate, and approaches to take.
So, what's the solution? Enter Fill in the Middle (FIM). Introduced in a paper last year by OpenAI, FIM is an under-discussed technique that allows language models to incorporate the context that comes after the cursor during training.
How Fill-in-the-Middle works
It's quite simple: let's say we have a training example that looks like this:
and we want the model to learn to predict the middle text
jumps over from the prefix
The quick brown fox and the suffix
over a lazy dog. First, we make two cuts to separate these sections, introducing new tokens
<EOM> (end of middle):
Then we simply transpose the middle and suffix:
Now, we train exactly like we did before, predicting the following text
jumps over<EOM> from the earlier text
<PRE>The quick brown fox <SUF> a lazy dog<MID>. The model automatically learns the meaning of the special tokens and learns that it is expected to generate text that makes sense after the prefix but before the suffix!
At inference time, if we're trying to infill a document like the following:
we can present it as
to the model and request characters until the model emits an
<EOM> token, at which point it has successfully joined the prefix with the suffix.
FIM vs non-FIM models
With FIM, we can greatly improve the accuracy of code completion tools by providing context to the model that would otherwise be missing. Let's see some examples comparing two different code autocomplete tools, Codeium and Tabnine Pro.
Codeium is a free code completion product used by tens of thousands of developers around the world. Codeium's enterprise offering allows customers to self-host Codeium in their virtual private cloud or on-premise to ensure that no data is sent outside of the company. Tabnine is an AI code assistant that also offers self-hosting for enterprises.
Here are two suggestions with the same prompt for each tool. Codeium, on the left, is using a FIM model which can see the usage of the
distance function below the cursor and is able to infer that it is supposed to compute the edit distance between
b. Tabnine Pro, on the right, at the time of writing likely didn't use FIM, and gives a worse suggestion as a result.
In this Golang code, Codeium understands that it needs to initialize the
messages channel, while TabNine just outputs
Codeium can even generate an accurate docstring for an already-implemented function:
Software engineering is rarely a linear task: programs are usually not written in one shot from start to end. Most day-to-day programming involves adding functionality, refactoring code, and fixing bugs—all tasks that benefit greatly from context after the cursor.
It should be no surprise then that code completion models trained with FIM capabilities easily outperform simple left-to-right models. Indeed when we deployed FIM for all Codeium users we saw large increases in our acceptance rates and user satisfaction.
Off-the-shelf code completion models like Salesforce Codegen (which powers FauxPilot) have not been trained with FIM, so code completion tools that want to use FIM need to train their own models. This is harder than it may seem—there are some subtleties involved in choosing where to cut the document and in ensuring that your model's left-to-right performance does not suffer.