Disclaimer: This post does briefly mention Codeium’s Enterprise offering, but the majority of the content is a general roadmap on how LLM development should get more personalized.
What's an LLM application?
Let’s massively simplify how we end up with an LLM application. First, a bunch of data is used to train an LLM. For the application, some logic extracts useful information that’s currently present, adds this time/user-specific information to some standard prompt to said LLM, which is then invoked and generates some output. This output is then processed and integrated back into the user’s view in a seamless way.
Building an LLM application on OpenAI / GPT-4
On the model side, all the rage nowadays is on GPT-4 and the likely impending training of GPT-5, which will increase the model size by another couple of orders of magnitude. The GPT-N family of models are meant to be generic, able to reason about any topic and inch towards AGI. This is powerful because AI product builders can then solely think about the “application” side without constraining themselves to particular industries or use cases. The application specific problems simply boil down to: (1) figuring out how to coax the LLM to produce outputs in line with the application in question (i.e. “prompt-building”), (2) identifying what information is relevant for this particular invocation that should be appended as “context” to the prompt, and (3) processing the output back into the application. The actual LLM invocation simply becomes an API abstraction.
Issues with Relying on External APIs
But for a model to be the best at everything, it needs to be significantly bigger than any model that is the best at something (both in terms of number of parameters and amount of data trained on) and this raises many practical issues. As an LLM application, this will cut into revenue (these APIs cost non-trivial money) and obviously will have higher latencies than a smaller, focused model. Then there is the existential risk - building an LLM application with zero control over the LLM half (which provides the potential of building a deep technical moat), knowing very well that someone could create a model that has higher quality on just your application.
Fine-tuning Models to Application
These issues have already been recognized - it is more and more common for companies to bootstrap with GPT-N but then train an application-specific model from scratch (or fine-tune / distill an open source model). This adds complexity to the training pipeline, but it gives you a lot more control. For example, we at Codeium do LLMs for code and all of our models are trained in-house with a combination of natural language and code-specific data and training stages. We now have complete control over our data, which means we can do things to improve the quality of the code suggestions (ex. guarantee enough presence of rarer languages in the training corpus) as well as things that we think are the right thing to do that someone like OpenAI might not follow (ex. removing GPL and copyleft code from the training data).
User and Enterprise Level Personalization at the Model Level
But to the crux of this post - why stop there? Why stop the model customization at the application level? Why not go all the way to the user level?
If you could further personalize your application-specific model given the “data examples” that a specific customer has, then you will create a model that is the theoretically best performing model for that application for that user. It doesn’t get better than that. For code, many companies are the industry leaders in what they do, have massive internal libraries of utilities, and adhere to specific conventions across millions of lines of code. This could look like fine-tuning, retrieval, anything to be personalized to the user’s codebase.
It might be a bit more unclear why two users with the same application might benefit from this as compared to why two users with different applications might benefit from the existing fine-tuning step. We will give some intuition why going through this extra complexity will always be better than a generic model with the proper context, and use the code LLM space as an example.
It boils down to how conventions are taken into account for any inference invocation - for a generic code LLM to adhere to syntactic conventions or to use libraries and utilities present in the particular codebase, we will need to pass all of that code in as context. But LLMs have a limited length of context (especially for low-latency applications like code autocomplete)! Say you want to write a unit test for a new function you added. For a generic code LLM, you probably want the function you are testing as context, but what about other tests to see if there are some test structure conventions or utility functions to keep codebase consistency? There is a lot of nuance there that it is close to impossible to write heuristics to know exactly what to pull in. But if the model was personalized on the existing code? Then you wouldn’t need to waste the invocation context on snippets just to adhere to the existing codebase. You are guaranteed to give a massive boost in performance that no generic model could ever reach.
Of course, the complexities come up in building that training infrastructure in a way that maintains the security of the user or company’s data and how to run inference across so many different variations of the same model. But from a buyer’s perspective, this is the holy grail of AI tooling.
A Short Plug: Codeium for Enterprise
At Codeium, we have built local personalization into our Enterprise offering, and combined with being fully self-hosted, this allows enterprises to have the most advanced AI-powered code acceleration toolkit for their codebases without sacrificing anything with regards to code security. Your codebase and the trained models never leaves your on-prem servers or VPC, however you choose to deploy.