TL;DR We prove that fine-tuning Codeium’s generic base code model on unseen code leads to substantial, observable improvements in suggestion quality over other tools such as GitHub Copilot.
Hallucinations and Why Fine-tuning Matters
If you’ve played around with any LLM application, you have almost definitely heard about and received what is dubbed a “hallucination,” i.e. when the model is confident in its blatantly false or fabricated output. While it may be slightly embarrassing if you use this information in a personal setting, believing these hallucinations in a professional setting can be downright dangerous. Probably the best example of this being a lawyer using ChatGPT submitting briefs with completely made up precedent cases. While this is an egregious example of things being made up, it is much more confident for these models to just confidently produce wrong answers, but this obviously begs the question - how can a model that has been trained on essentially all public data be so confidently wrong?
A very common reason is context. These models have been trained only on public data up to some certain date, so if you are trying to get some reasonable answer for your private data or for more recent information, you really need to provide all relevant background information as context to the model at inference time to trust that the result makes sense with reality.
Getting context right is especially more important and tricky for code autocomplete for three main reasons:
- For code autocomplete applications like GitHub Copilot or Codeium, unlike something like ChatGPT, this context collection is handled by the application rather than the user.
- For cost and latency reasons, these models can only pass ~150 lines of code as context (Codeium actually has twice the context length of GitHub Copilot, but that will be the topic of a future post...). This is because cost and latency scale quadratically with context length, so increasing to even 10 files of context would increase cost approximately 50-100x what it costs to run today, not to mention so slow that it would be practically unusable without breaking up developer flow.
- For code, the training data much more commonly has examples with the same term referring to different concepts (ex. think how many distinct schemas exist for a database table called “Users” in the public corpus - if you don’t specify the actual schema to use at inference time, the model very well can confidently “pick” a wrong one).
Combine all of these facts together and it becomes very clear why companies may be unsure whether a code autocomplete solution will work for their private codebase. If the generic system too often hallucinates entire schemas, utility functions, classes, etc, then it may end up being more of a time sink in later debugging than a help via autocomplete.
So we knew at Codeium, to build the highest quality product for a company, we would have to personalize our system on the company’s private codebases. Essentially, do additional steps such that Codeium has context on what exists in the codebase and what different terms or names refer to. We can split this into two halves - improving the context awareness and fine-tuning the model. The reality is that advanced realtime context awareness is what really helps solve most of these problems, but fine-tuning could be an extra bump.
Theoretically, fine-tuning should work, but there are a lot of reasons why this is a hard problem. How do we make sure the model is actually learning the private data? What parts of the codebase should be learned from that meaningfully improve usefulness to the user? For example, we probably don’t want to spend thousands of cycles training on package manifests or raw data files, even if they are in the repository. How do we know we are not not overfitting to the private data and degrading general coding performance? We had to think through all of these and more in building a robust fine-tuning system.
In this post, we prove that our fine-tuning system improves the performance of Codeium for private repositories by reducing hallucinations, thereby obviously surpassing any competitive product, such as GitHub Copilot.
Experimental Setup: Fine-tuned Codeium vs GitHub Copilot
Currently, Codeium (in generic base model form) and GitHub Copilot are the two most admired AI coding tools, according to the most recent StackOverflow developer survey:
Given this, it makes sense for us to pit Codeium with fine-tuning against GitHub Copilot, which doesn’t have a fine-tuning option. Codeium and Copilot are already comparable and it seems obvious that fine-tuning will improve on Codeium’s performance, so the real thing that everyone is interested in is whether fine-tuning will clearly break this tie between Codeium and Copilot.
What we want to therefore compare is the performance of Copilot, untrained on a private repository, against Codeium fine-tuned on that private repository. While we obviously cannot show the results on one of our enterprise customer’s codebases, we can use the new and popular Langchain library for demonstration. By using an extremely popular open source repository, we give Copilot the best chance of knowing stuff about Langchain, already giving it the benefit of doubt on whether or not Langchain is in its training data. We will fine-tune Codeium on the Langchain repository code, and then try to follow this Jupyter notebook to see how badly the two models hallucinate in a situation trying to use the Langchain library. We held out all of the ipynb notebooks from the finetuning set for this test, so we know Codeium won’t just be exactly regurgitating some training data - it will have to use the information it “learned” from the library to produce reasonable suggestions.
To further eliminate any accusations of bias, we will run Copilot with its full capabilities - we will let it pull in context however it wants to (imports, nearby files, etc). On the other hand, we will restrict Codeium as much as possible to eliminate any potential confounding variables in order to prove that the fine-tuning is the difference maker. We will artificially restrict Codeium’s realtime inference context to only include the current file so any embedded information about Langchain being used by the model to produce a suggestion is the direct result of the fine-tuning process.(1)
With all of this set up, let us see how Copilot and a fine-tuned Codeium fare!
Results: Fine-tuned Codeium Outperforms GitHub Copilot
We will see how well the two systems perform as we step through the tutorial.
Task 1: Using the Right Class Given Comment
A common task is knowing what API/method/class should be used given a description of what we want, so that we as developers can avoid searching through pages of documentation or directly reading source code. Here, we want to use a
TimeWeightedVectorStoreRetriever - we will keep typing out the name until the model actually suggests the right answer:
Above: GitHub Copilot
Not only did Copilot keep hallucinating until we were all the way to the
Store, we even had an occasion where it did not even make a suggestion, likely because it was unconfident in any of its predictions. Codeium on the other hand? Nailed it on the first shot. Probably no real surprise here - given how many different kinds of retrievers exist in the Langchain repository (which is still much smaller than most proprietary codebases), how in the world would Copilot know which ones should be put in the 150 lines of context? Most likely none were, and it just got lucky with phonemes when we reached the
Task 1 Take #2: Using the Right Class Given Comment
Let’s just make sure that the first example wasn’t a fluke. The next step is to load in the proper embedding model,
Above: GitHub Copilot
Ok, not a fluke. Copilot doesn’t know the codebase, so it likely is little help in generating new code that utilizes existing code properly.
Task 2: Populating Arguments
Ok, so maybe pulling out the proper reference from thin air is tough, but what about properly populating arguments to a constructor or method. Once you have the object, perhaps GitHub Copilot can then get all the arguments or attributes and help populate them without hallucinating. The next thing we have to do is create a
FAISS vector store and initialize it.
Above: GitHub Copilot
Here, it turns out the fine-tuned Codeium also makes some mistakes, like not capitalizing all of the letters in
FAISS and not knowing which way to map index to docstore. The AI is still not perfect, but that’s ok. This is still better than GitHub Copilot, which makes up a
FaissVectorStore object, hallucinates an
embedding_model argument, and probably the biggest issue, does not know how to extract the
embedding_function from the
embedding object, even though by this point, it should have enough context to be able to pull in this information directly from the
OpenAIEmbeddings class. What this demonstrates is that even with a lot of information on exactly what methods and classes are being used, GitHub Copilot, with no intrinsic knowledge of the underlying library, is unable to piece together what the underlying objects actually mean or do.
Task 2 Take #2: Populating Arguments
Once again, let’s try the task again. The next step is to initialize an agent with some memory, which is done via
GenerativeAgentMemory, and we’ll need an LLM, so let’s just use OpenAI’s GPT-3.5-Turbo:
Above: GitHub Copilot
Overall, the fine-tuned Codeium seemed to pretty much be exactly what we wanted. The only nit was the lack of capitalization on
OpenAIChat - there may be some further improvements we could on data sanitization in the fine-tuning logic! GitHub Copilot on the other hand? Well, not only does it make up
llm_model as an argument, it completely hallucinates a
GPT2LM model, doesn’t know that it is
OpenAIChat instead of
OpenAIChatModel, thinks that this
t as the temperature argument instead of the actual full word
temperature (which is probably common in many other ML libraries, but not this one), and doesn’t really ever pick up on the full name
gpt-3.5-turbo. Plainly, it again doesn’t seem like GitHub Copilot knows anything about what is actually going on in the Langchain repository.
Personalization is the Future of LLMs, Buy Accordingly
We can keep going, but by now it should be clear why fine-tuning can be another win for anyone trying to use generative AI for software development on private repositories. This is one of the reasons why in an earlier post, we claimed that the future of LLM applications was in personalization. Not just personalizing the UX and the realtime context, but also personalizing the model itself to the data and patterns that a generic model will never have access to. While many companies will tout numbers to make it seem like their products will help every developer, really ask yourself whether your code looks like public code, or if, instead, you have specific, proprietary code. If you are looking to buy, or even use, a solution, really make sure that it is valuable to you, otherwise you will spend more time debugging hallucinations than you nominally save by using autocomplete.
In a separate post, we talk about how we have taken this fine-tuning technology and integrated in into our Codeium for Enterprises offering in a way that makes it ideal for companies, even those with rapidly changing codebases, access control requirements, or budget concerns.
Contact us if you want to learn more:
(1) Of course, Codeium’s actual realtime context building is significantly more advanced than “current file” - look out for a future blogpost about the complex intricacies of Codeium’s industry-leading context building system.