Published on

The Golden Metrics: Characters per Opportunity and Percentage Code Written

Written by
Nick Jiang, Anshul Ramachandran, Mehul Raheja, Michael Li
Gold weighing scale

tl;dr We introduce two metrics, Characters per Opportunity (CPO) and Percentage Code Written (PCW). We believe CPO should be the definitive metric to benchmark different AI code assistant autocomplete capabilities and quality. This metric, unlike Acceptance Rate, is not susceptible to gameable confounding factors and is correlated with real tool value, so a higher CPO does mean a better product in terms of real value. We believe PCW should be the definitive metric to assess the value driven by a code assistant to the end developer, as it is a measure of the actual result of using the code assistant and is a relative metric that takes into account the fact that different developers code for different amounts of time to begin with. These metrics are related conceptually, but are not mathematically related in practice, as will be discussed. We make development decisions for Codeium based on CPO, and we currently have a CPO of 1.27, while we currently have a PCW of 44.6%, which means that 44.6% of new code being committed by developers using Codeium are actually from Codeium.

Why Do We Care About Autocomplete?

Most code assistants have many modalities, so why is a metric for autocomplete particularly important? Autocomplete is a dense, passive AI, which means that developers can get suggestions on every keystroke without changing their behavior to invoke a tool. This means developers can expect to get thousands of suggestions a day if coding for even a few hours. So while chat and other functionalities are undeniably useful, the sheer magnitude of value that can be driven by autocomplete dwarfs all other effects. From Codeium usage data, a developer gets hundreds of autocomplete suggestions a day, while performing five to ten chats a day. So even if a helpful chat is an order of magnitude more useful than an autocomplete suggestion, we know where real value is driven.

Why Existing Benchmark Metrics Cannot be Trusted

We actually wrote an entire other post about existing metrics, such as acceptance rate, and how you cannot trust published numbers on these statistics as evidence for real value. Simply put, there are too many ways to game such a metric at the expense of value driven, and so we incentivize the wrong product decisions if optimizing for such metrics. If we want to actually be able to compare one product to another on the basis of real value, we need a benchmark metric that cannot be gamed, where an increase in the metric actually corresponds to more value driven to a developer. And this brings us to our solution…

CPO: Characters per Opportunity

Straight to the chase, this is how we compute Characters per Opportunity:

Characters per Opportunity =
    Attempt rate *
    Feedback rate *
    Acceptance rate *
    (Avg Num Tokens / Acceptance) *
    (Avg Num Characters / Token)

Let us go through each factor:

  • Attempt Rate: Every time a user performs an action in-editor, such as typing a new character or deleting some newly-invalid code, this is an opportunity for the AI to attempt to make a suggestion. Attempt rate captures what fraction of opportunities does the AI even try to make a suggestion for. Reasons for not attempting could range from latency (debounce) or contextual filters that the AI has to determine whether or not to try.
  • Feedback Rate: There is obvious latency in making a suggestion, from context retrieval to network overhead to the actual model inference. If the latency is too high, the developer will move on and perform a new action in the editor, triggering a new opportunity and rendering this existing one useless. In addition, after the suggestion is complete, the tool may not decide to show the suggestion to the developer for a variety of reasons - not high enough confidence, triggers a suppression filter, etc. Feedback rate captures what fraction of suggestions even make it to the developer for there to be human “feedback” (i.e. accept or reject).
  • Acceptance Rate: Even if the suggestion gets to the developer, it still might not be good in the eyes of the developer. Acceptance rate captures what fraction of shown suggestions are accepted by the developer. This is the oft-publicly-cited metric.
  • Avg Num Tokens / Acceptance: There is differing amount of value between a long suggestion or a short snippet, all else kept equal. LLMs ingest input and create output in units of “tokens,” which are usually a small sequence of characters, so the average number of tokens per acceptance captures the magnitude of value delivered per accepted suggestion.
  • Avg Num Characters / Token: At the end of the day, developers see text in characters, not tokens, and different LLMs can have different “tokenizers” (i.e. set of sequences of characters), so if an LLM produces more characters per token, it actually writes more of the code, and average number of characters per token captures exactly this.

Put together, you get Characters per Opportunity (CPO).

Why CPO Measures Real Value and Cannot be Gamed

So why CPO? Simply put, there is no way for us to make this metric substantially better by simply cutting corners instead of actually improving the product.

Let’s look at some example of tools and how their decisions might make acceptance rate look really good at the expense of CPO:

  • GitHub Copilot: The most popular AI assistant boasts an acceptance rate of 30%. Incredible! However, what is not discussed is that Copilot has a lower attempt rate and feedback rate. They actually don’t show as many suggestions in the first place, which in turn juices the acceptance rate at the expense of actually reducing the amount of real value driven by the tool.
  • SourceGraph Cody: Cody is a great example of increasing acceptance rate at the expense of feedback rate. Cody uses third party APIs for their LLM layer and spends a lot of time in context retrieval, making the latency of autocomplete suggestions very high, frequently over a second. So even if these are high quality suggestions with a good acceptance rate, the developer is almost always onto the next action far before the suggestion is even shown. Perhaps ok acceptance rate, but poor real value delivered.
  • TabNine: TabNine is a good example of gaming acceptance rate via a low average number of tokens per acceptance. By heavily biasing to single line or shorter suggestions (i.e. the suggestion truncates at the end of the line), TabNine is able to deliver suggestions quickly (good feedback rate), but the value of each accepted suggestion is so much lower than competitors that even if they had a good acceptance rate, they would not be able to make up the real value delivered.

In fact, a great example of gaming acceptance rate across all tools is the lack of inline fill-in-the-middle suggestions. From telemetry, we know that approximately 33% of all opportunities are inline FIM opportunities, where folks are writing code in the middle of a line, yet no tool other than Codeium provides suggestions for them, i.e. they have an attempt rate of 0% for 33% of all opportunities! Why? Well, in-line FIM suggestions have a lower acceptance rate (especially if not done properly via an additional objective function at model training time), and so by not providing these suggestions, other tools can boast a higher acceptance rate at the expense of real value delivered to the developer. Acceptance rate can be gamed, CPO cannot.

Codeium’s Current CPO

Inline suggestions and end-of-line suggestions are different enough situations that we analyze the CPOs separately. Our end-of-line suggestions have a CPO of 1.78 while our inline suggestions have a CPO of 0.30. This means that even if we weight them by percentage of opportunities that are inline (~33%), we have a weighted CPO of 1.27.(1)

We do not know the end-of-line CPO of other tools, but as discussed in the previous section, we know the inline CPO of all other tools is 0, and we strongly believe the end-of-line CPOs of all other tools are lower than Codeium’s.

How We Have Developed Against CPO

We have been using CPO internally for a while. Since our attempt rate is close to 100% rate (we do have a debounce rate of < 2% due to latency, but we don’t have any pre-generation filters that would artificially lower our attempt rate) and our tokenizer rarely changes (average characters per token is constant), we usually focus on tradeoffs between the three middle terms: feedback rate, acceptance rate, and average tokens per acceptance. We will walk through a recent change that we’ve made where CPO guided our decision making.

Using a Larger Autocomplete Model

We recently trained an autocomplete model that has 60% more weights than our previous, and with the additional capacity, when we A/B tested these two models, we saw a clear multi-percentage-point improvement in absolute acceptance rate. However, when we deployed it on the same hardware with no changes to our inference stack, we saw that the increase in latency dropped our feedback rate so much that it more than neutralized the effect on CPO from the acceptance rate increase. This kind of tradeoff is why we usually brush off questions from customers on our model size, since even though size is correlated to acceptance rate, it is not necessarily correlated to real value. Anyways, we realized we had to do a lot of infra tricks such as quantization in order to still get the wins from acceptance rate while minimizing the effect on feedback rate. We didn’t do the infra work just for the sake of being smart, we did it because we needed to in order to drive more real value.

Real vs Perceived Value

So if CPO captures real value, why does acceptance rate get all the press? Simply put, it is a discrepancy between real and perceived value. While CPO does capture the totality of value driven by autocomplete, developers “feel” acceptance rate. People get a little dopamine boost every time they hit that tab key to accept a suggestion, and the more often they feel like the AI is giving them a chance to press that tab key, the happier they are and the more useful they think the tool is. This isn’t a knock on developers, this is just human psychology!

So what does this mean for CPO? Even if CPO is the golden metric, a secondary optimization of acceptance rate is important. After all, there are an infinite number of combinations of the values of the factors of CPO, with a bunch of different acceptance rates, that will multiply out to the same CPO. Conditional on a CPO, we should always optimize for picking the combination that has the maximal acceptance rate to also optimize perceived value.

In fact, it is probably fine to take a small hit on CPO if it means a large increase in acceptance rate since perceived value is important, but it cannot be taken too far (which most other products have done). We can demonstrate this graphically with some extreme numbers, by extracting acceptance rate (y-axis) from the CPO, with the rest of the terms combined being on the x-axis:

Acceptance rate perceived value

The shaded region captures an example performance space of a tool by tuning various constants without doing any real work to inherently improve the system. Point A, even though it has the highest CPO, has an abysmal acceptance rate, and changing the system in a way that gets to point B will increase the acceptance rate by an absolute 25% with just a unit decrease in the CPO. However, you can see the decrease in marginal value of taking this action if you consider changing the system to point C, because for the same unit decrease in CPO, you only get a 10% absolute increase in acceptance rate, and so if you keep optimizing for acceptance rate, you will quickly find it necessary to incur a drastic drop in CPO, or real value. If you can move along an isometric line towards a higher acceptance rate, by all means do so, but usually you don’t get such tradeoffs via simple hacks, corner cutting, or constant tuning. At the end of the day, to move up and to the right, you have to increase the CPO.

Percentage Code Written

Ok, with all of this talk of CPO, you may have forgotten that we are also introducing a second metric, which is Percentage Code Written (PCW). It is not hard to describe what this means - for any new code added to a codebase, you attribute each character as originating either from an accepted AI suggestion or not, and the percentage of such “positive matches” becomes PCW.

While reading CPO, you may have wondered why we could not just use CPO to compute PCW. After all, CPO intuitively captures how many characters an AI writes for every action a developer takes. So if an AI has a CPO of 1.5, shouldn’t this mean that for every character of code written by a developer, the AI is writing 1.5, meaning that the AI is writing 60% (1.5 / 2.5) of the new code being committed to the codebase? Well, in a world where the only edits made by a human are single character additions, then yes, this would be right. But in reality, people edit and delete parts of accepted completions, there are other ways to get blocks of non-AI-written code (ex. Intellisense), and more.

We want such actions to affect a metric that is capturing the end value driven by the AI, which is why we measure PCW by exact character attribution, but we do not want such actions to affect the metric we use to benchmark the assistant itself. For example, if someone accepts an Intellisense suggestion, that should not be interpreted as a negative signal to the AI assistant - for all we know, if given the opportunity, the AI assistant would have perfectly suggested that same block of code! This is why we separate CPO and PCW - CPO measures how good the tool is when given the opportunity, and PCW measures how much value is driven by the developer by integrating the tool into their workflow.

Conclusion

At Codeium, we are laser focused on creating products that are actually better and actually drive more value. We are not interested in optimizing our product for marketing purposes or to chase metrics and benchmarks that distract from our focus. We take our evaluation seriously, and have likely put more thought into how we do so than any other team working on these problems, which is why we can confidently say that the average Codeium user has 44.6% of their new, committed code written by Codeium. With our evaluation methodology, we are confident that when we roll out features to our users, especially our paying enterprise customers, we are actually driving more real value.


(1) Numbers don’t quite add up here because we have opportunities where we are unable to classify as inline, multiline, or single line, so we exclude them from the inline and end-of-line CPO values, but include them in the aggregate CPO.

Explore more

Jun 11, 2024enterprise7 min

A deeper dive into recent enterprise readiness capabilities: indexing access controls, subteam analy...

Apr 30, 2024enterprise1 min

WWT uses Codeium to accelerate their developers.

Apr 18, 2024company1 min

Codeium is the only AI code assistant in the 2024 Forbes AI 50 List.

Stay up to date on the latest Codeium & AI dev tool news

Please enter a valid email address