A member of our sales team likes to use the quote “if you can’t measure it, you can’t manage it.” While this is of course true of real time analytics for customers to justify tool usage and adoption, this saying is equally, if not more important for our product development process. Measuring and evaluating our systems over time is what allows us to steer in the direction of growth, and without it we wouldn’t be able to have the confidence we have that we are shipping a great product.
Evaluation is a hot topic in the LLM world, with recent headlines on “drift” in products like ChatGPT and a never-ending barrage of papers all supposedly proposing the next big breakthrough. As an end-to-end product, we have to be very thoughtful about our process and, existentially, develop a thorough, convincing set of tests to continuously assure the quality of our AI code assistant.
Existing Evaluation Methods
Unfortunately, “industry benchmarks,” especially in the AI code assistant space, are generally problematic.
Take for example HumanEval which was introduced by OpenAI to measure how well a model can synthesize functionally correct code based on a docstring explanation of intended functionality. The tasks look something like:
from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ [COMPLETE THIS CODE]
Sounds like a good idea, but not very useful in practice for a number of reasons:
Hardly anyone ever codes like this. All HumanEval problems are structured as a single function header with docstring in an otherwise empty file, in an otherwise empty workspace. Contrastingly, we’ve found in practice that developers in large code bases write code all over the place–in the middle of lines, switching between functions, and with a complex set of dependencies and libraries. These sorts of workflows require context awareness and different types of reasoning that are simply not accounted for by HumanEval tasks. Real developers work inside of real repositories, where their current work may involve intricate understanding of contextual information.
These are the wrong problems to optimize for. HumanEval is filled with Leetcode-like algorithmic problems. Many open source models, such as Phind’s Code Llama 34B model or Wizardcoder, are fine-tuned on problems that are very similar to HumanEval to improve HumanEval performance, with much less improvement on general tasks. To really highlight how non-realistic these problems are to a software engineer’s day-to-day, Problem 36 in HumanEval is actually just fizz buzz!
It’s too popular. The unfortunate reality of today’s LLMs is that too often, there is contamination between the testing and training data. HumanEval has been around for years and chances are that if you trained on public code, your model has likely memorized some of the solutions, making the results invalid. Goodhart’s Law strikes again.
There’s so much more to evaluate. When you are using an AI code assistant, you do not just care about the correctness of the code. Quality of IDE integration, latency, and ease of use are all aspects which can make or break your experience and we definitely need to take this into consideration.
Well if HumanEval is not that great, what about the system-level metrics, namely acceptance rate, that everyone seems to care about. GitHub Copilot advertises its 35% acceptance rate and other companies claim even higher numbers, but rates are misleading. If we really wanted to, we could have a near 100% acceptance rate tomorrow by suppressing every single completion except semicolons at the end of the lines in Java. This all stems from a huge misunderstanding of this paper which suggests that all-else-equal acceptance rate correlates with productivity; however, all-else is nowhere near equal.
As an example of this inequality, Codeium is the only tool which supports inline “fill-in-the-middle” (inline FIM) suggestions (full blog post). These suggestions are produced when the cursor lies in the middle, instead of the end, of a line. It enables huge improvements in productivity and developer satisfaction that competing tools simply do not offer, since developers often perform such edits, like when adding a new argument to a function header. But, while the acceptance rate on these completions is quite good, it is lower than non-inline FIM completions. It would be the wrong optimization for us to remove inline FIM suggestions for the sake of maximizing acceptance rate.
Other aggregate statistics, such as the overall percentage of checked-in code generated by an AI, are similarly misleading because not all code is created equal. Writing code in a large, legacy codebase, which is common for professional software developers, is very different from students completing standard Leetcode problems for homework. The difference in value of this statistic between these two groups is likely massive, but that won’t show up in the aggregate numbers.
One solution is to rely on qualitative feedback: look, we have the highest rating on our IDE plugins of any AI code assistant tool on the VSCode marketplace!
But we’ll be the first to admit that relying on such qualitative feedback is flawed. We as developers are generally proud of our craft, and aren’t always the first to admit that we need new tools to help us with our job. Gut feelings are always hit and miss, and ratings are too few and far between to be a statistically significant method of iterating on a product.
How We Do Evaluation Internally
Yet somehow, if you ask any of our users about Codeium’s performance over time, you will get a unilateral response that it has been generally up and to the right. This post is actually the first in a series of posts where we will deep dive on our evaluation methodology so that you can be confident that you are using a tool that goes through deep testing and vetting, but we thought that it would be most appropriate to begin by diving into two important axes:
- What part of the system are we evaluating?
- How do we stage testing and rollout?
The second question has a shorter, simpler answer. We aren’t going to roll out a new model or a new feature to hundreds of thousands of developers as soon as we finish training a model or writing the code. We want to be very data driven through a rollout. At a high level this looks like:
- Pre-merge Eval: We run extensive automated evaluation where we can get some quick feedback of how well our model is doing. This allows us to answer dozens of questions a day ranging from “did we break anything” to “what are the best hyperparameters” without ever having to degrade the quality of production.
- Internal Dogfooding: We expose the new feature or model to our internal development team to test out the changes. The wonderful part about being software engineers making tools for software engineers is that we are just as invested in assuring that the quality of our product is top-notch.
- Beta Testing: For major, visible changes, such as new capabilities or IDEs (as opposed to model improvements), we ask individuals of our vibrant Discord community to be Beta testers. Given their incredible variety in developer environments from different OS’s and remote vs. local development to language preferences and VPN configurations, we can catch a large majority of bugs through this process.
- A/B Testing: By now, we have ironed out any major usability issues, and we want to be statistically confident that our system improved. Using Unleash, we are able to A/B test between our hundreds of thousands of free users to gather metrics of improvements in productivity and performance.
- Prod Rollout: We have finally gotten enough feedback and measurements to confidently roll everything out to production and enterprise users. At this point, we also start working on porting these changes to our self-hosted customers, who can rest assured they are getting vetted and tested changes in this rapidly evolving generative AI space.
The first axis – ”what part of the system are we evaluating?” – is a much more complex nuanced topic. We’ll go into more details in future posts, but for now we’ll provide a high level overview of what evaluation metrics and signals we gather for each part of the process. How we actually balance these metrics and understand how the product is evolving is something we will save for another post.
Evaluating the Autocomplete Model
In case you are new to Codeium, we train our own proprietary LLMs! For just the core autocomplete LLM, built entirely from scratch, we utilize an evaluation method motivated by the execution evaluation instead of meaningless industry benchmarks like HumanEval.
The high level idea of what we do is:
- Use public repositories to find tested functions and corresponding unit tests.
- Automatically delete snippets of said functions
- Simulate autocompleting the deleted snippet
- Re-run the unit tests and see if they still pass
This way, we’re able to get a good measure of how our model does in writing the kind of code that developers actually write in their day-to-day use.
Alongside measuring correctness, we’re able to measure a variety of other metrics like latency, consistency, and ability to write correctly compiling code through this process as well – all measures which can indicate eventual performance in production.
You probably have a lot of questions. What determines what unit tests you use? What kind of deletions do you do? How do you simulate autocompleting? Why even go through all this trouble?
We promise to answer all of these and more in a follow up blog post.
Evaluating New Feature User Experience
Feature creep is a real problem for any software. Just because we build something doesn’t mean that it is worth shipping if it is barely used or adds confusion to our product. We want the developer to be able to intuitively use Codeium and not require hours of training to get up and running. This is why any UX change goes through vetting as well, primarily through interaction statistics.
Ever noticed those little “code lens” buttons above methods that interact with Codeium Chat, such as refactor, explain, or generate docstrings? All of the decisions around them, such as where we should be showing these code lenses or which common defaults to present, were made by surveying and A/B testing different combinations until we were seeing the maximum amount of quality interactions.
Evaluating the End-to-End Autocomplete System
One thing that often gets overlooked in this generative AI wave is that there is way more work that goes into the product than just the model. Sure LLMs are the keys that really kickstarted this generative AI revolution, but there is a plethora of engineering in other parts of the shipped product, such as UX, IDE support, and server locations that significantly affect the useful of the product. We might tweak the prompt construction or modify model-finetuning or change any number of other tunables. We can’t exactly apply the same “delete part of a function and autocomplete it” approach on these, but we can test system level changes on our hundreds of thousands of users to measure effects on metrics like acceptance rate.
Acceptance rate? This might be surprising given how much we bashed on aggregate metrics earlier. But the reality is that acceptance rate, when balanced with a variety of other metrics such as latency, bytes completed, and suggestions proposed, can be a great directional indicator with other factors kept constant. Differences between products are so large that comparing just acceptance rate between them is meaningless, but as part of an internal signal to know whether people are getting more value from an individual autocomplete system as it evolves? Totally valid.
How we Approach Comparative Evaluation
All this builds into a relevant, but separate, question that we frequently get asked by customers – how can we evaluate Codeium metrics against other tools to actually know which one is “best”?
This may not be a satisfactory answer, but in short, we don’t really spend too much time worrying about marketing ourselves against competitors. Sure, we continuously try out competitive products and sometimes talk about our lower latency or create comparative videos on quality (like here, here, or here), but really, for this space, any numbers you see in marketing material is just that - marketing.
Marketing might help us start some conversations, but at the end of the day, developers want to know what tool works well for them, not for some other developer or company. Your software development is unique, and you should be looking at your own metrics, calculated on your own development, to make a judgment call on what is right for you.
This is why realistically, the most important thing for us to do in terms of comparative evaluation is to build very thorough analytics dashboards that give individuals and enterprises transparency on how valuable the tool is to them. This is why we at Codeium have invested a high amount of effort towards building detailed real time analytics dashboards at both the individual and team/enterprise level. We focus on the metrics like number of acceptances that are reflective of productivity, while performing breakdowns so that any individual or team can take action to further improve usage (by day, by language, by IDE, etc). We strongly believe you should not pay us if we are not the right tool for you, but you can be confident that, if we are, we will show you the proof.
Some Takeaways on Code Assistant Evaluation
Maybe this post was disappointing: after all, we could not provide you the silver bullet metric to evaluate all AI coding assistants. But that is precisely the point – systematically improving these tools is really hard and requires a complex, robust evaluation framework to do so properly. By performing staged rollouts and spending lots of resources thinking about how to evaluate any changes, we at Codeium have built a trusted iteration flow that will continue to let us out-pace and out-deliver everyone else. In future blog posts, we will be pulling back more of the curtain on the details of particular evaluation processes.
That being said, don’t just take our word for it. If you want to use a thoroughly tested code assistant for free, try Codeium out yourself: