Published on

Using Code Syntax Parsing for Generative AI

Written by
Prem Nair, Anshul Ramachandran

Parsing syntax is fundamental to modern programming. It’s an important step in running high-level programming languages such as Python, Javascript, and HTML as well as in protocols like HTTP and database languages like SQL. It’s also used in applications such as syntax highlighting or code indexing. If you want to write an application that works on source code, you’ll need to parse code syntax, but it’s tricky to do so across multiple languages in a consistent and useful way. In this post, we:

  • Discuss some of the details of syntax parsing
  • Walk through one of the many small challenges we dealt with
  • Introduce an open source tool, codeium-parse, to use in your own applications

Why do we need to parse syntax?

Our product at Codeium started with AI-powered code autocomplete, and we realized that having some understanding of syntax actually would help us produce better autocomplete blocks. We could recognize when people are starting blocks of code, truncate model results at the appropriate boundaries (e.g., end of current scope), and more. In general, we realized that for generative AI, we could use syntax understanding to improve both how we compose the model's limited contextual input as well as our logic that cleans up model outputs.

Then, we built a novel capability – AI-powered repo-wide semantic search, i.e., ask a question in natural language such as “Where do we do X?” and get pointed to the exact code snippet that does that. Under the hood, it works by computing embeddings of code snippets and then performing a proprietary embedding search algorithm with the queries. For data security reasons, we wanted to compute and store the embeddings for the entire codebase locally (your entire codebase should not leave your machine), but that meant we would have to be picky about what we index like function signatures, global variables, etc. Syntax parsing all of a sudden became a requirement rather than a nice quality bump.

In the future, we plan on using code syntax understanding for a bunch more as a structured, deterministic complement to the AI that we are developing. Ideally, we would have an easily extendable framework to parse code syntax across languages in a consistent manner, rather than just throwing in custom logic for each of the many, many languages out there. And that ideal is what led us to the work in this post.

Code syntax parsing under the hood

Parsing and using code syntax for any application can be simplified into two steps:

  • Computing a structured intermediate representation of the code
  • Applying queries on this intermediate representation to extract useful information

One of the most common intermediate representations used for code is a concrete syntax tree (CST). By unrolling code into a tree where the nodes have metadata and the edges have semantic meaning, we can query for subtrees that match certain patterns to extract the information we want. For example, we might convert the following simple JavaScript code block:

// Adds two numbers.
function add(a, b) {
  return a + b;
}

Into the following CST (note: this format is what is produced by the codeium-parse binary that we introduce later):

program [0, 0] - [4, 0] "// Adds two numbers.\n…"
  comment [0, 0] - [0, 20] "// Adds two numbers."
  function_declaration [1, 0] - [3, 1] "function add(a, b) {\n…"
    name: identifier [1, 9] - [1, 12] "add"
    parameters: formal_parameters [1, 12] - [1, 18] "(a, b)"
      identifier [1, 13] - [1, 14] "a"
      identifier [1, 16] - [1, 17] "b"
    body: statement_block [1, 19] - [3, 1] "{\n…"
      return_statement [2, 4] - [2, 17] "return a + b;"
        binary_expression [2, 11] - [2, 16] "a + b"
          left: identifier [2, 11] - [2, 12] "a"
          right: identifier [2, 15] - [2, 16] "b"

or visually:

drawing

Then, looking for something like arguments to each function just becomes querying for a subtree that looks like a parameters node with a number of identifier subnodes, and extracting the identifier nodes (as shown in the green nodes in the diagram).

One of the best and most popular libraries to produce CSTs is tree-sitter, authored by Max Brunsfeld. It’s used by Github’s website to do things such as syntax highlighting and code symbol navigation. There are two main technical advantages of tree-sitter:

  • It can incrementally parse and update the CST when there are edits
  • It’s error-tolerant, so it’s good for in-progress typing

In addition to these, tree-sitter does not need to execute any compiler or other tools, and importantly for our applications, applies consistently to all languages if you have a grammar for the language.

With tree-sitter, we have a pretty good framework for the first half of parsing and using code syntax. tree-sitter actually makes each of its language bindings implement their own grammars and predicates (essentially filters and post-processing steps). This is probably to avoid having to build it with dependencies like a regex library, so we knew that this was something we had to build, but the framework was in place.

Now, on to the second half of using code syntax — the queries. tree-sitter has its own scheme-like query language and execution engine built in. However, it turns out that the queries in the tree-sitter repos are often stale, and queries in places like nvim-treesitter are mostly used for syntax highlighting and navigation. We wanted much more powerful queries to be used in real code analysis for core applications (as opposed to prettifiers and nice-to-have-but-not-required tricks). So, we realized there wasn’t an existing library of queries with similar power to what tree-sitter has done for parsing. This is where we would have to build the most.

Technical challenges in code syntax parsing

The most interesting technical challenge was actually writing the queries. Originally we hand-wrote code that navigated the CST. But this was brittle, so we switched to tree-sitter query language, which lets you write patterns to capture subtrees of the CST. But to see why it's tricky, let’s try writing a query that would capture any function docstring (or identify if the function doesn’t have a docstring). Consider the following code snippets:

// Comment
// Second line
function foo() {}

and

// Comment
export function foo() {}

The only difference is an export keyword! How hard will writing a query be that could capture both?

Let’s write out the CSTs (flattened, so the brackets represent hierarchy):

  • Code snippet one: [doc] [doc] [function def]
  • Code snippet two: [doc] [export [function def]]

Well, given the first code snippet CST, clearly we would need something that looks for some number of [doc] nodes followed by a [function def] node, and then return the [doc] nodes, so maybe something like:

[doc]* [function def]

(Note: this is not actual tree-sitter query syntax, but we are using this shorthand for legibility - consider the asterisk like a regex asterisk, which translates to “any number of [doc] type nodes”)

This will definitely work for the first snippet, but not for the second. The query would match, but not where we want it to! We would match the [function def] subtree and say there are no [doc] nodes (zero repeats for the asterisk): [doc] [export {[doc]^0 [function def]} ] (match is between the braces).

Okay, to match the proper doc string in the second example, we could add another option to our query:

[doc]* [function def]
OR
[doc]* [export [function def]]

Well now, we actually match the second code snippet twice, once incorrectly like before, and once correctly. So now we need to find a way for the first option to not match the second code snippet while still matching the first.

The solution to this is a [#not-has-parent? export] predicate on the [function def] node. This means that we will match on the [function def] node only if it is not a child of an export node (which is true in the first code snippet, but not the second). So the final query looks like:

[doc]* [function def][#not-has-parent? export]
OR
[doc]* [export [function def]]

The first condition will match only functions similar to the first code snippet (without the export keyword) and the second condition will match only functions similar to the second code snippet (with the export keyword). In practice we have to do something even more complicated.

Beyond writing powerful queries, there was an assortment of other technical issues we faced while building a library that provided language-specific tree-sitter grammars and useful queries:

  • We wanted all of the parsers to integrate into one cross platform binary so that anyone could use them anywhere. We used a lot of cgo and wrote our own tree-sitter bindings in Go. Other bindings do exist like this one, but they don’t implement enough predicates and don’t play well with our bazel monorepo.
  • tree-sitter is great, but to provide end-to-end parsing and querying, we hit a bunch of OOMs and timeouts at the parsing stage since we’ve stress-tested tree-sitter on a vast amount of code during our work indexing entire codebases. We’ve started to use fuzz testing to detect and improve parser issues, and will upstream any fixes we’re able to find.
  • We wanted to make sure that the tool could play nicely as a portable unix-like tool with other tools. This meant writing JSON outputs and creating a single statically linked binary, among other things.

Open sourcing a syntax parsing command line tool

After working on this code syntax parsing and querying project for a while, we realized (a) we had done a nontrivial amount of work that we should share with the broader community to use and (b) it would be best for us to also get the contributions from the broader community moving forwards. So we open sourced it at codeium-parse, an end-to-end command line tool for parsing and querying code syntax!

We have created some high quality queries for the community and hope to crowdsource it for rarer languages. Anyone can contribute in a few different ways:

  • Find/think of useful applications of code syntax parsing that can use codeium-parse
  • Tell us what grammars you want to see supported
  • Contribute queries in new languages to help us share useful information extracted about code
  • Add unimplemented attributes we’re not yet detecting, such as imports, enums, and constants

We would love to work with you to keep pushing code syntax parsing forwards!

About Codeium

Here at Codeium, we are building the most powerful AI-powered code acceleration toolkit. We started with autocomplete, similar to tools like GitHub Copilot, getting it to a point where we are confident in saying we are as good, if not better, than all competitors. While we are continuing to improve autocomplete, we are also building novel AI applications to the software development workflow, such as natural language based, repo-wide code search.

We are making Codeium free forever for all individual developers, and will monetize by working with enterprises (there are some AI superpowers that are only viable at organizational, rather than individual, levels). We already have thousands of developers using Codeium daily to accelerate their development cycles, and we’d love for you to give it a spin as well. Codeium is available on all major IDEs, including as a Chrome extension for notebooks and websites, and works on over 40 of the most popular programming languages.