UPDATE (3/27/23): This post was written in January 2023. Since then, we have made large improvements to the quality and latency of suggestions, including crucial features like Fill-in-the-Middle. We have updated the scores to reflect the current performance, but the videos have not been re-recorded. The other tools have not displayed a material change in performance. Codeium has also added much more functionality such as Codeium Chat, which the other tools do not offer.
UPDATE (4/24/23): We performed a separate analysis on Amazon CodeWhisperer once it was released from Beta.
Github Copilot. Tabnine. Replit Ghostwriter. There's a handful of AI powered code assistants out there, and maybe you've tried a few. Maybe you have a favorite, maybe you are still looking around. But with every product having handpicked and edited examples for PR purposes, it is hard for anyone to truly know which is the best.
That ends today.
We will assess the four leading AI code assistants (Github Copilot, Tabnine, Replit Ghostwriter, and Codeium) along multiple axes, each of which provides a different perspective of the word "best":
- Features and price: We decided to group these together since most provide the same base feature set, but at various price points and with slightly different capabilities and details.
- Latency: How fast does the product generate suggestions?
- Suggestion quality: How good / accurate / bug-free are the AI generated suggestions?
First, this is written by the team at Codeium, but we will be as objective as possible, choosing non-controversial metrics, highlighting where there may be potential confounding factors, and especially stressing areas of improvement for Codeium.
Second, as we stressed in our blog post with swyx, we are still in the early stages of AI powered code acceleration. All of these products are constantly evolving and improving, so this comparison should be viewed as such, not as an indictment of future quality (not to mention new products popping up!). We know for a fact Codeium will be rapidly improving!
OK, enough disclaimers. Let's jump into it.
And if you don't want to scroll all the way to the bottom, here's the tl;dr for the cumulative scores (out of 30, 10 for each axis):
- Pros: High quality on complex tasks, generally low latency, available on wide set of IDEs
- Cons: Price, occasional latency and quality issues
- Pros: Very fast, available on common IDEs
- Cons: Low quality across all tasks, price
- Pros: Decent quality, generally low latency
- Cons: Only available in Replit IDE, price
- Pros: Free, generally high quality and low latency across tasks, available on the widest set of IDEs
- Cons: Occasional latency and quality issues
Features & Price
This axis is the most objective. We will assume that you are an active software developer who codes at least every work day and are looking for a long term solution. This will remove highly capped offerings and limited time trials from our comparisons. Using information from each product's website, we can compile all aspects where the offerings differ:
|Github Copilot||Tabnine Pro||Replit Ghostwriter||Codeium|
|Price||$10/mo or $100/yr (free for students / open source contributors)||$12/mo||$10/mo (cycles equivalent)||Free|
|Functionality||Single + multi line codegen||Single + multi line codegen||Single + multi line codegen, Code explanation||Single + multi line codegen, In-IDE integrated chat and search|
|Supported IDEs||VSCode, Visual Studio, Vim/Neovim, JetBrains||VSCode, JetBrains, Neovim, Eclipse, Sublime Text||Replit only||VSCode, JetBrains, Visual Studio, Jupyter / Colab / Deepnote / Databricks Notebooks, Vim / Neovim, Emacs, Eclipse, Sublime Text, VSCode Web IDEs (ex. Gitpod), Chrome Extension|
|Stated Security & Privacy Policies||Opt-out for code snippet telemetry, Filter to reduce public code matches||Never train generative model on private code (unless for enterprise), SOC 2 compliance, No training on non-permissively licensed code||Unclear||Opt-out for code snippet telemetry, Never train generative model on private code, SOC 2 compliance, No training on non-permissively licensed code|
Of course, these are going off company claims. What one company might consider "good enough performance" on a particular programming language to consider it "supported" probably differs strongly from another's. Same with functionality - a company may have implemented particular features, but we cannot make a statement on how useful or functional the feature actually is in practice (without assessing latency and quality).
One thing is certain though - all products offer single and multi-line code completion, and this feature is what we will put up to the test when comparing latency and quality.
Features & Price Overall Ratings
Our goal at Codeium is to democratize this tech for all developers, so we have set up our features and pricing to be the absolute best possible. This means free, on more IDEs than anyone else, respecting developer privacy, and supporting the most languages. We also provide significantly more functionality than other tools, such as in-IDE chat (think ChatGPT, but integrated tightly where developers operate), that the other tools simply do not provide. Therefore, by construction, we knew this axis Codeium would rank as high as possible (again, this is no statement on latency or quality). From conversations with developers around the world, we have come to realize that the pricing of all other options place these tools out of reach for most global developers. This is the major reason for a drop off after Codeium, but this difference can be adjusted based off how much you value 100-140 dollars a year. Smaller reasons such as number of IDEs and supported languages explain the discrepancies between Github Copilot and Tabnine, while Replit Ghostwriter is at the lower end due to forcing the Replit IDE and a lack of clarity around security & privacy.
Latency for code completion is very important. The user gets value by staying in the "flow" state and cannot wait a couple seconds for an inference as you see in ChatGPT, DALL-E, Jasper, etc. Average human visual reaction time is ~150ms, which is a legitimate constraint when dealing with multi billion parameter models.
Theoretically, there is a latency vs quality tradeoff since serving a larger model is generally slower but generally results in better quality suggestions. However, given differences in model architectures, training data, and model optimizations (ex. quantization), we believe latency in model inference may not be perfectly correlated to quality, which is why we separate these two axes. Once we add product-level latency such as network data-passing overhead (where are the servers with respect to the user?) and smart caching of inference results client-side, there are a bunch of other confounding reasons for the latency that a user experiences.
It seems the fairest comparison would be to take a task that has a relatively standard solution, use a new/refreshed IDE instance (to clear caches), and observe how long it takes to construct the solution with each tool. We will pause when we expect suggestions but don't receive them and take slightly longer when there is a really long suggestion that we need to pattern match to in order to mimic what a true programming experience with these products would be like.
So what task? Well, if AI is truly the end of Leetcode-style coding interviews, let's use a classic - create a linked list class in python that supports adding new nodes and searching for data, and then write a test for good measure. The answer here is pretty standard, so let's see how long it takes for each:
Copilot ended up not providing suggestions in some pretty expected places (like
__init__ after the
LinkedList class definition, especially when the init was provided for the
Node class). Also, while it is true that
insert is a vague function name and so it wasn't unreasonable for Copilot to suggest just adding a node at the head rather than end, we did have to cycle through a few suggestions to get what we wanted, and each scan of a multi-line function increased time spent. Otherwise, Copilot was pretty snappy.
Tabnine was fast, but it did add code in suggestions that I didn't expect, such as out of scope of the current method (ex. way past the current
__init__ definition) or printing out results rather than assert statements. That being said, Tabnine was easily as fast as the others on this example. Tabnine does claim to run a hybrid of running a small model on the client and a larger model remotely, so perhaps on other examples when we hit the local small model, we wouldn't experience any network latency overhead, and that could potentially be faster. This of course will come at a cost of quality.
Ghostwriter had generally pretty reasonable latency. The one place that felt comparatively slow was when it decided to create the insert function line-by-line rather than entirely or in chunks. The latency increased when we expected suggestions like a negative case and didn't receive it until it was mostly typed it out. Finally, there were also some minor quality issues like adding the definition for an unwanted
delete function to the completion for
search that had to be deleted post-acceptance and returning the node from
search rather than a boolean, but these can be ignored for a latency assessment.
Like Copilot, Codeium also ended up not getting
__init__ on the
LinkedList class when we would expect it, adding latency to the overall experience. Otherwise, Codeium was providing accurate multiline completions with reasonable latency, and single line completions for testing were rapid. Update: Codeium no longer has these latency issues due to improved infrastructure and extension logic.
Latency Overall Ratings
Overall, all of the tools seemed to have reasonable suggestion-generation latencies given human visual reaction time. A lot of the discrepancies in the overall latency of experience actually occurred due to results not matching expectations, such as not receiving suggestions where we would expect them or having to make edits / delete parts of accepted suggestions). Given there was no egregiously slow suggestion generation latencies, quality will likely be a much larger factor to discrepancies in overall value to a developer.
Quality is a very subjective axis, but a required one. Anyone can provide a fast feature that is completely useless.
We are well aware that we will raise suspicions of bias on any example that we propose, so instead, we will choose handpicked examples advertised on front pages from all of the non-Codeium products. These are likely some of the more impressive examples for which each of these alternatives create accurate, bug-free code. While we are putting Codeium at a disadvantage in all cases, by doing the full crossproduct of tool and example, we will also be able to assess the other alternatives on non-handpicked examples.
We chose four examples from different languages, autocomplete use cases, and problem domains:
- Github Copilot: External APIs (TypeScript)
- Github Copilot: Databases (Go / SQL)
- Tabnine: Machine learning / Common packages (Python)
External APIs (TypeScript)
The task: Find and use an external API that would compute whether the sentiment of some input text is positive. Here's what it looks like on the Github Copilot front page:
We would expect Copilot to do well here given this is a Copilot example, and it was able to generate multiple fully functioning method bodies (via cycling between options) with different external endpoints. Bravo.
Just not great. Top suggestions were localhost URLs, thereby not passing the task of suggesting APIs without the user needing to know any such APIs. Tabnine also added a bunch of unnecessary fields in the header, suggested comment versions of previous lines, and more.
Ghostwriter did a good job finding a URL, but needed to be explicitly prompted to extract the sentiment rather than just return whether the
Works as expected with a reasonable external endpoint, and even adds error handling of the response!
Databases (Go / SQL)
The task: Given a table schema and a descriptive function name, compute the aggregate statistics via SQL query. Again, from the Github Copilot front page:
Again, we would expect Copilot to do well since this is one of their handpicked examples, and as expected, Copilot completely nails it without any edits necessary.
Tabnine struggled mightily with the query, and did not demonstrate any understanding of the semantics of the table, suggesting fields in creating the
CategorySummary object that just don't exist. Also missed some error handling in relevant spots, while trying to add it in places where it was unnecessary. Overall, we do not have high confidence in Tabnine in this context.
Ghostwriter first tried to create a new table, and then the suggested query wasn't even a valid query, let alone correctly interpreting the semantics of
createCategorySummaries (tried selecting
tasks when there was no such column). Once we wrote out the whole query, Ghostwriter needed some manual changes to the first error handling, but after that, the results look good.
Codeium understood the concept of aggregating by categories, but missed the
COUNT aggregation (it likely does not understand the semantics that the count of rows in the
tasks table is equal to the
tasks field in the summary object). Otherwise, everything else looked good with no modifications. Update: Codeium has seen an increase in quality on Go and many languages that are more "rare" since this post was launched, due to improvements on training data sanitization and training data sampling logic.
Machine Learning / Common Packages (Python)
The task: Classic ML training pipeline of loading data, test/train splits, training using common packages (
sklearn), and producing eval results. Here is what it looks like on Tabnine's front page:
Copilot got into import hell with sklearn metrics, and even re-entered it after manually escaping it. It also assumed the label column was the last one in the train-test split, which isn't necessarily wrong, but definitely an assumption. Otherwise, the results look good.
We would expect Tabnine to do well since it is their own example. That being said, Tabnine had the same issue importing too many sklearn metrics, assumed the label column was the last one, and even used imports that were not actually imported (
sklearn.metrics.mean_squared_error). Overall, not super impressive for a hand picked example.
Ghostwriter also got kept importing sklearn metrics and at the end, it also tried to use the
mean_squared_error utility from
sklearn.metrics without actually importing it. Also, the train-test split wasn't actually correct, since
pd.read_csv does not return an object with
target attributes. Ghostwriter was the least impressive on this example.
Didn't have the same import issue or use imports that did not exist, but assumed that
Survived was the label column, which is a different assumption from Copilot or Tabnine. Otherwise, the results are exactly what we would expect.
The task: Rapidly spit out boilerplate and unit tests code to save typing time. As shown on the Replit Ghostwriter demo:
Copilot surprisingly struggled here. The suggestions often repeated the earlier part of the line, which is completely nonsensical, but when accepting, the inserted code looked nothing like the suggestion. Even when it gave reasonable suggestions, the suggestion was improperly merged with the existing code, thereby generating additional characters that needed to be manually deleted. We expected the closing parenthesis and semicolon of the tests to be suggested, but we ended up having to manually add it in. Using Copilot in generating this relatively simple boilerplate was shockingly more work than just typing it out manually.
(As a caveat, the
TextDecoderStream replacement is not an issue of Tabnine, that was just our local Intellisense)
Tabnine was adding test cases that were not relevant to the test description, and some of the indentation was off. Also, as we also saw in the latency comparison linked list example, Tabnine does not seem to understand any concept of "function scope," where suggestions should end at the end of the current scope. For example, we expected suggestions to end at the end of a test case, and for Tabnine to provide another suggestion afterwards to start a new test, rather than putting the two suggestions together (forcing us to choose between typing out the part I want and accepting + deleting the part I don't want).
Ghostwriter surprisingly made some very small mistakes on their own example, like adding
done for arguments in the second test case, but otherwise clean.
Missed semicolons at the end of the test cases, but otherwise exactly what we should expect.
Quality Overall Ratings
This is a subjective assessment, but it appears that Github Copilot and Codeium had roughly similar consistency in addressing the goals across the tasks, with similar rates of manual intervention necessary. Replit Ghostwriter seemed to be a slight rung below, and Tabnine just didn't seem to be able to solve the majority of tasks, with enough errors that it felt more of a distractor than assistant.
Adding up the ratings in the individual axes (very scientifically inaccurate, but a consolidated metric):
Really it just comes down to Copilot vs Codeium, and a lot of the difference is currently rooted in whether you care about $100/year or use IDEs like notebooks or web IDEs like Gitpod where Copilot is not available. The other options, while exciting, are just too far behind at this time.
One thing that we did not assess is the potential growth of these products. Copilot has been around for a year and a half, and is the result of two companies (OpenAI and Github) that may not have aligned incentives in continual development. Codeium, on the other hand, has caught up in just a few months, a large part because we are fully vertically integrated and can make decisions optimizations across the stack to make a silky smooth end to end experience. We are also actively listening to and acting on user feedback, unlike Copilot, with a vibrant Discord community. Update: Since this post, Codeium has also launched natural-language Search capabilities, with more functionality coming soon. Copilot initiatives such as Copilot for X are yet to be moved from beta to production.
At the end of the day, this assessment is inherently limited. While this might shift your opinion slightly, we know there is only one way that you can be truly convinced that Codeium is and will continue to be the #1 in-IDE AI-powered code assistance tool for your use case - you trying it for yourself. Again, we are free and take less than 2 minutes to install, so don't take our word for it - try it in our web Playground without any installation, and if you like what you see, click below!
We're just getting started.