Published on

Codeium for Enterprises: On-Prem GitHub Copilot.

Written by
Codeium Team

TL;DR You shouldn’t have to choose between best security practices and improving developer productivity with AI-powered code assistance. Codeium for Enterprises is purpose built to run on-prem or in your VPC - no data or telemetry ever leaves.

AI in a Datacenter.

Background

It would be a massive understatement to say that software engineers want to use AI tools as part of their development process, with the most recent StackOverflow Developer Survey exposing that 70% of surveyed developers already use or are planning to use these tools, with 77% favorable towards using AI tools, primarily for increasing productivity.

And the potential productivity gains are no slouch. From surveys and analysis on usage telemetry that our users have shared with us, we are consistently shaving hours of time every month in just typing with Codeium’s autocomplete functionality. This doesn’t include the cognitive load of thinking of what to type or searching elsewhere to get answers. Or how much more “in the flow state” a developer is. Or how much of other developers’ time is saved by not needing to ask questions. Just purely from typing time via number of characters accepted and typing speed.

So it is no surprise that there is a bunch of organic interest from enterprises to have their developers use these AI tools. More productivity means shorter development cycles which means better features which means stronger business. Business leaders go online to search for tools and the first thing that pops up is - purely from being around the longest and backed by a great distribution channel - GitHub Copilot for Enterprises.

But before swiping the credit card and purchasing a bunch of licenses, usually one simple question comes up...

Is GitHub Copilot for Enterprises Safe to Use?

The short answer is no.

GitHub Copilot for Enterprises simply enables seat management and admin controls over the individual GitHub Copilot plan. This means that code snippets still leave your control and go to third-party servers, managed by GitHub and OpenAI. Even though these are large companies that surely take security seriously, they have had major security incidents in the last few months, such as described in this post and this post. Companies like Samsung and Apple ban tools like ChatGPT and GitHub Copilot. Even Google is telling their employees to not put confidential information into Bard, their own AI. There are clear security concerns sending code out, and we don’t think any company should be exposed to that.

GitHub Copilot for Enterprises sometimes convinces enterprises to make exceptions in their security posture by (a) promising to delete some data post-inference and (b) having legal terms that make it seem like they completely indemnify companies that use Copilot. This makes it look like the security isn’t as big of an issue while there are more legal protections than other tools, which explicitly state that the user is responsible for the code generated by the tool. However, both of these break down on a deeper look.(1)

So no, from a security perspective, GitHub Copilot does not follow the gold-standard for security, and you don’t get any extra legal protection in exchange for taking this additional security risk.

On-Prem GitHub Copilot? Codeium for Enterprises

Codeium is an AI-powered toolkit that is as good, if not better, than Copilot. Copilot was simply the first AI code assistance tool so it dominates the mindshare of the market, but Codeium has been actively outpacing Copilot, introducing new autocomplete capabilities that Copilot doesn’t have such as in-line FIM, and productionizing features like Chat that Copilot has been simply promising as part of Copilot X.

Codeium for Enterprises is a purpose built solution that, besides not training on non-permissively licensed code to reduce legal concerns, focuses on giving unbeatable security and quality for each individual customer. We recognize that companies such as yours have strict requirements on data security and code standards. Codeium is the only high-quality AI-powered code acceleration toolkit that has a high-security self-hosted enterprise offering, which can be deployed entirely on-prem or in customer’s Virtual Private Cloud (VPC). This means no code or telemetry leaves your tenant, no use of external APIs, and no questions on privacy and infosec. We recognize that every company has different data handling and management policies, as well as hardware setups, so we are flexible and offer a wide range of methods to deploy Codeium for Enterprises in a self-hosted manner.

Codeium for Enterprises has even been deployed in environments such as AWS GovCloud, an entirely airgapped cloud (no connection to the internet) used by the US government and government-adjacent companies. These entities have trusted Codeium, which gives a sense of the level of security we provide.

Self-hosting Options

We support over 100k developers on Codeium’s individual plan, so we have had to optimize the hardware cost per developer like crazy for our own wallet’s sake. We guarantee that our hardware requirements will be the cheapest per-developer out of any such self-hosted products, and the per-developer cost decreases with scale.

For VPC, we are cloud agnostic, having supported deployments in AWS, Azure, and GCP so far, and we can deploy on as little as a single A10 GPU system. In AWS, g5 instances (1xA10 GPU) can support ~100 developers. This is ~$6k/year on reserve assuming no cloud credits. The g5 instances scale with number of developers (ex. 8xA10 can support ~800 developers). These numbers could be larger if developers operate in different timezones, and we have seen some international companies being able to support upwards of 300 developers on a single A10 GPU. Other beefier GPU SKUs that we support (and would give better performance) are A100s and soon-to-come H100s, which will support ~500 and ~1000 developers respectively. Given AWS currently has 8xA100 instances, this would support ~4000 developers on a single machine. Therefore, we will be able to scale to 4000-5000 developers with a single node system, but can start with a single A10 instance for pilot purposes so that you can build confidence in the ROI of Codeium. We support similar SKUs on all other cloud providers, such as Azure and GCP.

For on-prem, Codeium would require a box with a compatible GPU. We could whitelist a Dell box for you, but it is quicker to a pilot if you already have compatible hardware available on prem. We have verified compatibility with a bunch of different GPU SKUs, and can support around 4000 developers in your company for less than $80k in one-time hardware cost.

Again these are all single-node costs, and as you can imagine with Codeium supporting 100k developers (and rapidly growing) on our individual product, we are able to horizontally scale out the software to multi-node systems as well.

Deployment Process

We recognize that there might be a feeling that a self-hosted system, either on-prem or in VPC, would require a lot more deployment, maintenance, and upgrade work than a SaaS offering such as GitHub Copilot’s. We get it, and probably the highest compliment we have received from our existing customers is that Codeium for Enterprises “feels like a SaaS offering.”

For deployment of a single node system, which can scale to thousands of developers, we have simplified the process to a Docker Compose app on a Virtual Machine. We don’t even need Kubernetes (although also possible). Internally, we have been able to set up Codeium on a fresh VPC with only 15 minutes of active working time. Upgrades are similarly easy - just reapplying a new set of images via a new Docker compose. And maintenance? Pretty much non-existent. We have built a lot of robustness into the system that even if a machine goes down, there is an underlying persistent storage that will let you pick right back up without losing any state.

What an On-Prem GitHub Copilot Allows For

You might still not be convinced. And hey, GitHub is a big company, there is a difference between trusting GitHub versus trusting a startup like Codeium. Maybe the extra hardware cost isn’t worth it.

Well, it is actually what being self-hosted enables us to do that is way more interesting than being self-hosted in itself.

The first big obvious enabled capability is analytics. We can actually store information and telemetry about the suggestions that your Codeium instance is providing and how developers are using the tool. How much time is being saved? What languages are getting the most value? How does usage this week compare to last week? We provide all of these analytics in real-time on a portal spun up as part of the Codeium instance, to both individual users and system administrators. We are all software engineers, we like to be data-driven, and a GitHub Copilot won’t ever be able to provide this granularity because to promise some baseline security, they can’t persist information about your prompts and suggestions!

The second major enabled capability is being able to learn from your existing code and knowledge bases. This is a little subtle, but relies on the fact that since the Codeium instance is within your tenant, the Codeium instance can actually be allowed to see all of your code and documentation all of the time. This is never possible for a solution that is not self-hosted - that would literally require letting some other party continuously see all of your raw IP, not just snippets for a fleeting inference moment. We can actually use your existing code and knowledge to finetune our base model locally within your Codeium instance (again nothing ever leaves!), to create a model that is significantly better for you than any generic model can be for you. So there is actually a theoretical limit on GitHub Copilot given the data it cannot see, a limit Codeium for Enterprises is able to shatter through from construction. We will write an entire post on finetuning and how it is done in a follow up blog post!

We are still discovering all of the ways that Codeium for Enterprises can provide more value to a particular company due to its nature of being self-hosted, so these are real wins on top of the security guarantees.

Self-Host Today On-Prem or in VPC

If any of this sounds even remotely interesting for your company, or just want a second opinion before paying a bunch of money for an AI tool, we would love to chat. Just drop us a message:


(1) From a security perspective, there are guarantees on code-snippets being deleted but GitHub will keep and use your usage data, and “some of this information may include personal data.” They are actually quite open about using this data to train ranking and sorting algorithms to improve the experience and conduct experiments.

The legal argument is a bit trickier. The tl;dr is that these terms do not actually provide the indemnification that you may expect. First, the defense obligations don’t apply if the customer has not enabled all filtering features, which we analyzed in this post are incredibly conservative, degrading the performance greatly, while also not actually filtering out cases where they should! However, within that same terms, they also state "any GitHub defense obligations related to your use of GitHub Copilot do not apply if (i) the claim is based on Code that differs from a Suggestion provided by GitHub Copilot" which means that even if all filtering features are enabled, the defense obligations do not apply if the Customer makes any changes to the Suggestions. The underlying GitHub customer agreement also limits defense to GitHub products “not combined with anything else.” In case of lawsuit, GitHub would absolutely argue that the incorporation of a Suggestion into a customer’s code would be both a “change” and a combination. Finally, the Copilot terms explicitly say: “You retain all responsibility for your code, including Suggestions you include in Your Code.” So taken together, what GitHub is really saying (albeit in a slightly tricky legal manner) is that "if you enable all filtering features, we’ll defend you against a claim that our unmodified suggestions, used in a vacuum and not incorporated into your own code base, infringe.” This is obviously impractical, and Microsoft very much knows this - given the far reaching implications of such legal cases (if they do arise), even a company like Microsoft/GitHub will not want to be held liable for suits brought up against individual companies.