Published on

GitLab Code Suggestions: Why You Should Not Build Your Own Generative AI Product for Code

Written by
Codeium Team
GitLab robot is sad and broken down.

TL;DR: It is incredibly hard to build a competitive generative AI product for code from scratch. GitLab is the most recent example of a company who has had an underwhelming product launch, with their AI autocomplete solution far behind established products in many aspects.

Difficulties of Building a Code AI Product

Generative AI is moving fast, with rapid advances happening simultaneously across research, open source, and applications. Meanwhile, plenty of open source platforms from Langchain to Hugging Face have made it nominally easier to spin up an LLM application. Given the speed of development and the availability of tools, companies have increasingly decided to invest in building their own LLM applications to tackle a task rather than buying from or partnering with an established product in that space. And perhaps this is a good decision for some industries and tasks, but for ours (software development), we are observing that companies are repeatedly making the fallacious assumption that it is easy to spin up a robust LLM application for code. In this blog, because we have invested a ton of engineering effort already to build an industry-leading product, we focus so much on talking about cutting edge code LLM capabilities such as Fill-in-the-Middle (FIM) and inline FIM. So much so that we sometimes understate how much work it takes to even get to a reasonable starting product, with the gap growing as leaders such as ourselves and GitHub Copilot continue to build. Earlier, we discussed how Amazon CodeWhisperer is extremely far behind, even after months of private beta and iteration, and today we will look at GitLab’s solution.

GitLab wordmark.

Why do companies like Amazon and GitLab think they should build their own LLM-powered autocomplete models? Well, on paper, it seems easy! Autocomplete is a constrained problem, isn’t tied to any particular spoken language, and because of latency considerations, the model has a theoretical size limit, making it financially reasonable to train from scratch. There is also a lot of open source work being done, such as with Starcoder, specifically in the code LLM space. And to some degree, these companies are likely afraid people will switch off their core products because of products like GitHub Copilot (i.e. will people stop using the AWS or GitHub web IDEs because they don’t have native AI autocomplete support?). In short, there are a lot of reasons why creating their own autocomplete product could be low effort, high reward.

But here’s what they forget, or more likely, are unaware of. There’s the obvious engineering work of building IDE integrations with the expected UX, but from experience, we know there are a lot of other things you need to think of than just the model. How do you pull context in for the inference? How do you pull in and use metadata such as language, both programming and natural, to get baseline relevant suggestions? How do you merge the generated snippets with the existing code so that it makes sense and is not distracting? This is a small preview of the extensive engineering work required on top of the modeling work, and we have noticed companies dive in without thinking about these thoroughly.

But most importantly, they forget the chicken and egg problem - to keep iterating and creating a better product, you need users to give you feedback. But to get users, you need to have a product that people would consider using. And you don’t have a product that people would consider using if you are objectively behind in every respect to a competitor like Codeium or GitHub Copilot.

GitLab is the Latest to Unsuccessfully Try AI

GitLab Solution Overview

So, let us look at GitLab. GitLab is known by developers as a GitHub alternative for source code management. With respect to building with AI, they actually started by trying to build their own code assistant tool from scratch, bootstrapping off of the Salesforce Codegen model. However, they did not have the expertise in house and had no control over the set of languages or model quality (Codegen supported only 5 languages), so gave up on that and started using Google Codey as its model backend. So, the following analysis of GitLab’s autocomplete will be a reflection of both Google Codey’s model and GitLab’s use of the model.

GitLab’s autocomplete has been in beta for 8 months, which is a substantial amount of time in this generative AI space, and is only generally available on VSCode and their web IDE (not even Jetbrains yet!) for a very limited set of languages. To put this in context, we at Codeium GA’d our VS Code-integrated cloud-hosted non-personalized autocomplete 8 months ago, and have since grown to hundreds of thousands of users and have shipped significantly more capabilities (70+ languages, chat, search), features (FIM, non-permissive license removal, enterprise self-hosting + fine-tuning), and integrations (40+ IDEs). So, with 8 months of feedback and development, we would expect GitLab’s solution will at least give reasonable suggestions.

So, let us run through some tests, starting easy and ramping up.

Test 1: Toposort

If you have read other posts on this blog, you will know we love this as a baseline test - write a Python function that topologically sorts the nodes in an input directed graph. Let’s start:

OK, something is completely busted. We are getting comments in Korean, not even English, and even bad unicode characters. When we try to soldier on with the for loop, no suggestions are produced at all. This actually is probably not even an issue with GitLab, rather the underlying Codey model from Google! Likely something is completely off with tokenization, and the fact that even Google is getting this wrong shows just how difficult it is to build a code LLM that even works, let alone be good.

Test 2: Fibonacci in Python

Let us back it up to something even simpler - a function to generate the Nth Fibonacci number in Python. This might be the most classic of classic tests. Let’s see what happens:

Alright, at least we get suggestions. But this example just highlights just how poor GitLab’s model-surrounding logic is. First off, it is generating Rust code instead of Python in a Python file. But even past that, there is poor stopping logic - instead of completing the function, the suggestion stops mid recursion statement. In addition, a closing parenthesis is added after this incomplete statement, so when the next suggestion comes in to complete the function, we have an extra trailing parenthesis just hanging out. Of course that next suggestion also has poor stopping logic and keeps rambling into a fib2 function. GitLab should have added post-processing to remove any code after the starting scope is complete, which is AI autocomplete 101, but they have not.

So we have suggestions but it is clear that with GitLab, the suggestions are more distracting than useful.

Test 3: Code LLM Specifics: Fill-in-the-Middle

OK so clearly the processing is not there, but let’s also look at whether the model has capabilities and features that are necessary for high quality suggestions for code specifically. One of these capabilities is Fill-in-the-Middle, which is very important since code after the cursor is often as important of context, if not more important, than the code before the cursor. So, if you only see preceding code, you are putting a theoretical ceiling on suggestion quality.

It does seem that Google Codey has this capability on paper looking at the Codey API - it seems like you can pass in both a prefix and suffix. So to check that GitLab is using this properly, let’s try Fibonacci again, this time in C++ to get suggestions in the right programming language, and try to do some FIM tasks like seeing if it will regenerate existing suffix code or if it can write an appropriate docstring:

GitLab couldn’t solve the FIM tasks. Ignoring indentation errors and poor stopping conditions, when we tried getting suggestions at the beginning of the function a second time through, it regenerated the function, which would not have happened if it had context of the following text (and saw that the code already existed!). And then when we tried to generate a docstring, a model with FIM capabilities would be able to reason about the function body to write a reasonable comment, but instead we got a generic “write your code here.”

This also highlights the need to own your model and to be fully vertically integrated, so that you can make these kinds of improvements without being at the mercy of a third party provider (Codey released the suffix capability very recently with minimal description on what is happening under the hood). Unfortunately for GitLab, it already tried doing this and likely did not have the AI expertise to pull this off, which will likely keep it forever slightly behind competitors.

Test 4: Context Building Capabilities

Alright, clearly the output is not well formatted and the model itself is clearly not great. A single autocomplete call consists of collecting input, model inference, and processing output, so let us complete the investigation and look at context building capabilities, i.e. does GitLab pull in relevant context into the inference for input?

The basic context is the code in the same file, but very often we need context in other files to not hallucinate suggestions. The prototypical example of this is writing a source file given a header file in a language such as C++. So let us try to write a Time source file given a header file, nothing complicated:

It didn’t take long to get evidence that GitLab was not pulling any context from the corresponding header file since it got the arguments of the constructor in the wrong order, probably hallucinated from examples used to train the base model. It is especially surprising because the header file was also open at the same time on a different tab, so this didn't require any separate analysis of the repository directory. With this, we can guess that any advanced context collection also does not exist.

Looking Forwards in the AI for Coding Space

It is clear even with an extended beta, GitLab’s solution today is many steps behind what developers now expect from such a tool in every respect - IDE and programming language availability, context collection, suggestion quality, integration of suggestions into existing code, and baseline model capabilities. Even worse, it is artificially restricted to work only on GitLab repositories, creating a massive burden of adoption for the product. Unless you are for some reason forced to use the GitLab web IDE, there does not seem to be a reason to use GitLab over alternatives such as Codeium and GitHub Copilot.

So, what comes next? If there is anything that we have learned, it is that this space is moving super quickly so GitLab’s AI suggestions could totally improve, we won’t count anyone out. The key is that this would require them to get a lot of users to get feedback from, and this feels unlikely because (a) they are late, (b) there is no reason for developers, even GitLab customers, to use GitLab’s AI instead of competitors, and (c) GitLab’s AI using third party services and servers actually goes against their basic ethos and one of their main differentiators from GitHub. Keying on this last point, the security implications are huge of not just sending your code IP to third party servers, but third party servers not owned and managed by GitLab, something we talk in depth of on why self-hosting is the way to go if you care about security.

Here’s the thing - we have a lot of respect for the GitLab team and what they have built so far in source code management and other developer tooling. They are a very strong technical team that has built, and continues to build, stellar products. Generative AI, however, is just one of those areas that simultaneously has unique technical expertise needs and visible gaps between leaders and newcomers today. Generative AI absolutely can, and will, improve their product long term, as we are seeing with other companies with IDEs such as Deepnote, who recently launched their AI autocomplete by partnering with us at Codeium. We are always excited to see new development in the generative AI for code space!

Explore more

Jun 11, 2024enterprise7 min

A deeper dive into recent enterprise readiness capabilities: indexing access controls, subteam analy...

Apr 30, 2024enterprise1 min

WWT uses Codeium to accelerate their developers.

Apr 18, 2024company1 min

Codeium is the only AI code assistant in the 2024 Forbes AI 50 List.

Stay up to date on the latest Codeium & AI dev tool news

Please enter a valid email address