Published on

Fine-tuning with Codeium for Enterprise: Personalizing AI Tools to Your Code

Written by
Kevin Hou, Nick Jiang, Rahul Sridhar

TL;DR By nature of self-hosting, Codeium can personalize the system to your repositories and knowledge bases, which includes continuously fine-tuning on this internal data without additional compute or manual intervention. This is one part of how Codeium for Enterprises gives your company the highest quality system for your use cases, unlike any other AI coding assistance enterprise product.

Cute robot reading a lot of books.

Background on Personalization and Fine-tuning

About a month ago, we wrote a little bit about how personalizing LLMs to a user or company would be the next real bump in quality, just how fine-tuning LLMs to an application (ex. code) make them better for that application. Specifically in the coding assistant space, the tl;dr of the argument was that you are the industry experts in what you do, and have more relevant, higher quality code in your private repository than can be found in public repositories. Therefore, a generic system, irrespective of deployment method, would not provide the same generation quality as personalized systems, and this includes the model itself.

The crux is that a single inference to a model like GitHub Copilot can only take up to ~150 lines of code as context (2048 tokens). Even GPT-4, the best reasoning engine of our day, can take up to maybe 3 to 4 files of code as context (32k tokens). So while Codeium offers twice the context length (4096 tokens) and while the more important half of personalization is creating advanced context collection methods to best fill the limited context length, these may not always be enough to understand a company’s internal libraries, use of APIs, semantics, and more. To further reduce hallucinations (suggestions that look right but really aren’t because they refer to methods or data structures that the model incorrectly guesses existence of), it helps if the model actually “sees” all of the methods and data structures that do exist.

A couple days ago, we started publishing examples that show how fine-tuned Codeium models have massively better suggestion quality than generic model-based systems like GitHub Copilot.

There is corroborating evidence from other companies that matches our first hand experience that fine-tuning models on existing proprietary code bumps suggestion quality, and therefore bumps developer productivity. Meta walked away from GitHub Copilot to fine-tune their own code model primarily for this reason. Meta knew that these models would need lots of context to produce reasonable results, but given that costs scale with input tokens, even if the context length was massively increased to get good results, the price tag would be prohibitive. The much more reasonable approach would be to train the model on Meta’s code, which would also keep latency of the model under control. Now, Meta, like any company, knew they would never send their entire codebase over the internet to another company just to fine-tune (let alone Microsoft), so they fine-tuned their own model, which they named CodeCompose, on their proprietary codebase. And the results were clear: there was “a 23-25% improvement brought about by fine-tuning the LLM on the [sic] Meta’s internal code.” In particular, the model improvements were most evident in Meta-centric languages such as Hack, for which there is relatively little public code available for pretraining. Thus, by fine-tuning on internal code, Meta researchers demonstrated the potential for a code assistant to acquire capabilities personalized for their company that were unavailable from purely pretraining on public code.

Naturally, every company is now asking the question - how do we get the same personalization?

Considerations for an Enterprise Fine-tuning System

First and foremost, a lot of companies are not comfortable sending their entire private repository to some third party for any form of personalization. Maybe you and your security team make an exception for sending code snippets to third party servers a la GitHub Copilot or Amazon CodeWhisperer, but the entire codebase? That’s too hard a sell. Even if you use a SaaS source code management tool like GitHub, allowing your source code model to be used as training data on some infrastructure that you have zero control or visibility into feels wrong. This is the double whammy with our decision to make Codeium for Enterprises entirely self-hosted on-prem or in your VPC - it’s not just a security win for inference, but also an absolute necessity for personalization.

But there are a lot of other considerations:

  • Code changes rapidly, more so the larger your company gets, so ideally the fine-tuned model will always have context on the most recent version of your code. A one time job on a snapshot of your codebase will quickly go stale.
  • At many organizations, not every developer has access to every repository, often for security and confidentiality reasons. It would be a problem if a model is fine-tuned on a repository and is served to a developer that doesn’t have access to that repository because that exposes leakage across these access control lines. But at the same time, if a developer does have access to that repository, then the model they use should be fine-tuned on that repository for best quality. This immediately means you would ideally have multiple fine-tuned models to best serve every developer.
  • GPUs are expensive, and maintenance is costly. How many extra compute resources do you need to provision to do fine-tuning? How much work is it for someone to oversee this system? How often does the system need to go down to perform this fine-tuning? All of these are real costs, which is why SaaS has blown up as a business model.

Clearly, it is not easy for every company to develop a satisfactory system themselves, let alone the actual modeling and data ingestion intricacies. That’s why we instrumented it all in Codeium for Enterprises for you.

Fine-tuning with Codeium for Enterprises


Before diving into how it is instrumented in Codeium for Enterprises, first see examples for yourself that fine-tuning with Codeium does work by reading our previous post. Here’s just one of the examples that shows how GitHub Copilot (first) hallucinates when trying to find the right class to use given a comment, while a fine-tuned Codeium model (second) gets the right class, TimeWeightedVectorStoreRetriever, on the first try:

Above: GitHub Copilot

Above: Codeium

How Fine-tuning is Instrumented for You

As part of personalizing Codeium for your enterprise, we support fine-tuning our base generative models on your existing repositories and knowledge bases entirely within your self-hosted Codeium instance, which means none of your data will leave your management and no-one will gain temporary or permanent access, including us. These models will automatically be deployed in your self-hosted Codeium instance - the model weights will never leave your instance either.

This fine-tuning will happen on the same hardware as used to serve developers, so you will not need to provision extra resources to perform fine-tuning. The system will smartly use any idle cycles to perform fine-tuning steps continuously, so you will not need to worry about provisioning extra compute resources, scheduling fine-tuning passes, or causing temporary outages for your developers. GPUs are expensive, so this allows us to actually get you the best value for them - we will maximize resource utilization, which essentially means that we will always be fine-tuning in every millisecond of downtime. Whenever the system gets an inference request from a developer, the current update step will be preempted so that your developers, which is just a sub-10ms latency hit from this co-location of training and inference, imperceptible to humans. By taking this approach, we have the cheapest hardware requirements (zero extra cost!) as well as guarantee that your fine-tuned model will always have the most up-to-date view on your knowledge bases.

In the case of security constraints on sensitive data or multiple sets of access controls, Codeium can run multiple fine-tuned models on the same hardware resource, each exposed to different sets of repositories as dictated by access controls. Depending on the hardware provisioned, we can host a large number of fine-tuned models, and this is because we have built some complex ML infrastructure to swap in weights at inference time based on request metadata (such as user). This allows us to dynamically serve multiple models on the same GPU even if the sum of all model sizes greatly exceeds the available memory on the chip.

Codeium is able to fine-tune on data from any Source Code Management (SCM) tool, such as (but not limited to) GitHub, Gitlab, and Bitbucket, and soon, also non-code data that would be relevant for software development (ex. Confluence documentation). We are SCM agnostic. The company will only be required to invest in a one-time setup effort to generate read-access-only tokens to these SCMs/data sources and provide these tokens to the self-hosted Codeium instance. The company will point to a branch to read from, which is often a main or master branch. Because of the continuous fine-tuning, that’s all the required management!

Oh, and we provide a handy dandy dashboard so you have transparency into the entire system:

Finetuning dashboard for Codeium.

How Is All of this Fine-tuning Possible?

Your first reaction might be that this all sounds a bit too good to be true. No extra hardware? Many fine-tuned models? Zero maintenance overhead? It just… works? How can a relatively new product like Codeium have such sophisticated infrastructure?

It might be useful to provide a bit of history - Codeium is actually not our first product. For a couple years before Codeium, we built large-scale ML infrastructure for some of the world’s largest AV and robotics companies through an Infrastructure-as-a-Service model. The secret sauce was in GPU virtualization and other tricks such as batching and smart model compilation (we built our own GPU compiler for model inference!) to increase the normally low utilization of GPUs at inference time. This helped reduce the GPU requirements for some of our customers by almost 97%, and we were even managing around 20% of all of GCP’s GPU inference capacity in us-central-1.

All this to say that this infrastructure is the result of many years of work, something that will just naturally take any other team a while to catch up on. Our expertise is in ML infrastructure, and this is what will make the Codeium experience magical for any company, even more than the state-of-the-art models.

What We've Found About Fine-tuning

Simply put, fine-tuning is a bit hit or miss depending on the identity of your codebase. Even if you have hundreds of thousands or millions of lines of code, if it is all code that looks roughly the same as public code (with perhaps specific internal naming), fine-tuning does not really provide much more personalization value on top of realtime advanced context awareness. But if you do have roughly a few hundred thousand lines of domain specific language (DSL) code or other incredibly specific code in your codebase, you should be able to notice cases where the model is giving more reasonable suggestions.

Also, we don't want to promise fine-tuning as a silver bullet that works right off the bat. Normally, enterprises have to go through a few iterations of configuring the proper repositories, choosing the right subset of code, etc, to actually see performance improve, if ever.

We’ve also found that more code is pretty much always better. A common question we get is whether it is ok to fine-tune on code that might not actually be good quality. Generally speaking, we haven’t (yet) found the base model “unlearning” good coding practices. Again, this is a numbers game - a large codebase may have 10-100M tokens of code while our base model has been trained on the order of trillions of tokens of code. The point of fine-tuning is not to teach the model good coding practices (that should already be encoded in the base model), but to teach the model how your company does things. We have found, however, that it is good to exclude parts of the codebase that may not be reflective of how the company does things today (ex. old parts of the codebase that may be using deprecated APIs). And in terms of speed of fine-tuning, we believe that companies with medium sized codebases will start seeing some effect of fine-tuning in just a couple of nights of idle time, provided that the existing code is amenable to fine-tuning.

Comparisons to Other AI-powered Products

So how does this shape up with other tools? Well, pretty much all of the other tools, such as GitHub Copilot and Amazon CodeWhisperer, don’t even have a self-hosted offering. Their enterprise offerings are simply their individual offering with seat management added on. Just like with Meta, there is no room for personalization.

The only other established self-hosted offering is Tabnine’s, and on paper, they also offer personalization. However, there are some very large differences. First, just from a base model quality perspective, Tabnine still had a lot of room for improvement when compared to Copilot or Codeium, and our analysis has shown they were missing a lot of basic LLM features for code. So even if their personalization improves the model, companies are starting at a much lower quality. However, other parts of the self-hosting and fine-tuning offering are also less than ideal:

  • The Tabnine system itself required multiple expensive GPUs to self-host (minimum A100s) even without fine-tuning, as opposed to Codeium, where many hundreds of developers can be served from a single A10 GPU, which is ~$6k/year on reserve in a cloud like AWS with no credits.
  • Tabnine’s fine-tuning was a single job at a snapshot of the codebase, not a continuous process like Codeium’s.
  • Tabnine’s fine-tuning was more expensive and hands-on, costing many tens of thousands of dollars per fine-tuning job, which probably requires additional, even more expensive, hardware (or if not, would require taking down the system to train on the provisioned hardware).
  • It was unlikely given the demonstrated ML infrastructure capabilities that the Tabnine system could support multiple fine-tuned models on the same hardware.

We at Codeium are obsessed with getting our enterprise customers the best quality experience that they could theoretically achieve for themselves. This means building an up-to-date personalized system, not a generic one, that complies with all of the enterprise’s internal requirements. Oh, and not costing a fortune.

Reach out if this sounds interesting to you and your company: