Querying your code

Tree-sitter is amazing. I’ve used it in many cases (and it’s likely you have too, even if only indirectly). Here are a few examples of problems I’ve faced and how Tree-sitter helped me solve them.

Escaping Logseq

I’ve been migrating away from Logseq. For those unfamiliar, it’s a journal and note-taking tool. While you can edit notes in a plain text editor, it primarily uses a WYSIWYG interface.

Logseq’s Markdown is unique and weird. When they say “everything is a block,” they really mean it. For example:

- # heading
    - ## subheading
        - content
        - more content

This looks fine within Logseq and is recognized by its plugins, but once you move away, this structure doesn’t play well with standard Markdown tools. For instance, most Table of Contents generators won’t recognize these indented headings.

I wanted to transform the structure into something standard:

# heading
## subheading
- content
- more content

Because I have a massive number of notes, doing this manually was out of the question. Regex isn’t the right tool here either, you’d have to write a terrifyingly complex one. We need a syntax tree. This is where Tree-sitter comes in.

Tree-sitter is commonly used in modern editors like Neovim, Helix, and Zed. While IDEs like VS Code might use their own solutions, Tree-sitter addresses a very specific set of problems. I highly recommend checking it out. For the scope of this post, just think of it as a high-performance code-parsing library.

The Solution

You could try to do this without structured parsing, but using Tree-sitter makes it far more convenient. Once we query the nodes, we know exactly which byte offsets and position ranges to target for transformation.

Here is the Python script I used:

from tree_sitter import Parser, Query, Node
from tree_sitter_languages import get_language
from pathlib import Path
import sys

# Remove indents from the line
# indent: number of spaces to remove. Typically calculated before this function is called
def unindent_line(line: bytes, indent: int, tab_width: int = 4) -> bytes:
    removed = 0
    i = 0

    while i < len(line) and removed < indent:
        if line[i] == ord(' '):
            removed += 1
            i += 1
        elif line[i] == ord('\t'):
            removed += tab_width
            i += 1
        else:
            break

    return line[i:]

# how many leading spaces the line has
def leading_spaces(line):
    count = 0
    for b in line:
        if b == ord(' '):
            count += 1
        elif b == ord('\t'):
            count += 4  # or treat tabs separately
        else:
            break
    return count

FILE_PATH = sys.argv[1]

def depth(node: Node):
    depth = 0
    while node:
        depth += 1
        node = node.parent
    return depth

MARKDOWN_LANGUAGE = get_language('markdown')

parser = Parser()
parser.set_language(MARKDOWN_LANGUAGE)

# nodes that looks like this: `- # heading`
query: Query = MARKDOWN_LANGUAGE.query("""
(
    list_item
        (list_marker)
        (atx_heading)
) @list_item
""")

bytes = Path(FILE_PATH).read_bytes()

tree = parser.parse(bytes)
captures = query.captures(tree.root_node)

while captures:
    edits = []

    first_node, _ = captures[0]
    first_node_depth = depth(first_node)

    # We may found those `- # heading` at different depths. We want to deal with each depth one at a time.
    # Here we only take those nodes whose has the same depth as that of the first node.
    captures = [capture for capture in captures if depth(capture[0]) == first_node_depth]

    for node, name in captures:
        # remove the `- `
        edits.append(
            (node.children[0].start_byte, node.children[1].start_byte, b""))

        # If "the markdown section has a body".
        # There is a chance that the code below works even without this if statement. I haven't checked.
        if len(node.children) > 2:
            start_byte = node.children[1].end_byte+1
            end_byte = node.children[-1].end_byte
            block = bytes[start_byte:end_byte]
            lines = block.splitlines(keepends=True)

            # how many leading spaces we should remove from each line in the section body
            indent = min(
                leading_spaces(line)
                for line in lines
                if line.strip()  # ignore blank lines
            )

            new_lines = []
            for line in lines:
                if line.strip(): # ignore blank lines
                    new_lines.append(unindent_line(line,indent))
                else:
                    new_lines.append(line)

            replacement = b"".join(new_lines)
            edits.append(
                (start_byte, end_byte, replacement))

    # Apply edits.
    # It's important to apply them backwards, so the offset of the first edits are not affected.
    for start, end, replacement in sorted(edits, reverse=True):
        bytes = bytes[:start] + replacement + bytes[end:]

    # Parse the document again. Get ready to process another set of headings.
    # It is possible to just process all levels of headings in one go, with only certain parts of the tree updated,
    # for efficency, but we will just do it the simple way.
    tree = parser.parse(bytes)
    captures = query.captures(tree.root_node)

print(bytes.decode("utf-8"))
Path(FILE_PATH).write_bytes(bytes)

That’s quite a bit of code. A lot of it is just specific to the format of my notes. The most important piece is the query. The heavy lifting is done by this query:

(
    list_item
        (list_marker)
        (atx_heading)
) @list_item

This tells Tree-sitter: “Find list items containing a marker followed by an ATX heading (the # heading style), and capture them as @list_item.”

Once the query is constructed, 99% of the problem is solved. The rest of the code simply handles the byte-range manipulation and formatting.

Confusing commits

I spend a significant amount of my time doing code reviews. Occasionally, I receive patches that look like this:

diff --git a/a b/b
index 687997f..2e3028a 100644
--- a/a
+++ b/b
@@ -15,27 +15,27 @@ public class SmartWatch
          return result;
      }
 
-    public int DrinkWater(int glassesMorning, int glassesEvening)
+    public int SleepHours(int night, int nap)
      {
-        int result = glassesMorning;
-        result = result + glassesEvening;
+        int result = night;
+        result = result + nap;
 
-        if (result < 8)
+        if (result > 9)
          {
-            result = result + 1;
+            result = result - 1;
          }
 
          return result;
      }
 
-    public int SleepHours(int night, int nap)
+    public int DrinkWater(int glassesMorning, int glassesEvening)
      {
-        int result = night;
-        result = result + nap;
+        int result = glassesMorning;
+        result = result + glassesEvening;
 
-        if (result > 9)
+        if (result < 8)
          {
-            result = result - 1;
+            result = result + 1;
          }
 
          return result;

Can you guess what changed? Nothing. It’s just a function reorder.

And sometimes, it’s even trickier:

diff --git a/a b/c
index 687997f..32d5198 100644
--- a/a
+++ b/c
@@ -15,27 +15,27 @@ public class SmartWatch
          return result;
      }
 
-    public int DrinkWater(int glassesMorning, int glassesEvening)
+    public int SleepHours(int night, int nap)
      {
-        int result = glassesMorning;
-        result = result + glassesEvening;
+        int result = night;
+        result = result + nap;
 
-        if (result < 8)
+        if (result > 8)
          {
-            result = result + 1;
+            result = result - 1;
          }
 
          return result;
      }
 
-    public int SleepHours(int night, int nap)
+    public int DrinkWater(int glassesMorning, int glassesEvening)
      {
-        int result = night;
-        result = result + nap;
+        int result = glassesMorning;
+        result = result + glassesEvening;
 
-        if (result > 9)
+        if (result < 8)
          {
-            result = result - 1;
+            result = result + 1;
          }
 
          return result;

Can you guess what the code change is? It’s a function reorder plus an actual logic change. It gets worse when a reorder is mixed with an actual logic change. If the commit message is vague or misleadingly says “just a refactor”, you’re left hunting for the needle in the haystack.

Most of the time, you shouldn’t perform meaningless refactors for no reason. But occasionally, it makes sense. In those cases, I could just reject the patch and ask the author for a cleaner version, but why not do both? It’s frustrating and enjoyable at the same time to solve the puzzle. What if you have to figure this out in a legacy codebase where the author is long gone? You shouldn’t be defeated by this. We’re skilled programmers, remember?

You might be tempted to ask AI to detect the changes. That’s an option, but how can you be sure the AI isn’t hallucinating?

Let’s go back to our friend, Tree-sitter.

The query is as simple as:

(method_declaration) @method

Once you have captured the list of methods, you can simply sort them by name and write them to a temporary file. Do this for both versions of the source code, and you end up with two files that can be diffed cleanly. Any real code change will immediately come to light.

Simple queries like this can be adapted to almost any language. Other grammars might not use the exact type method_declaration, but there is almost always an equivalent, as most languages share the concept of a “function” or “method.”. This is one of the reasons why Tree-sitter is so amazing: all you have to adjust is the query, and the rest of the code you wrote still works. Unlike other AST solutions that require a complete rewrite for different languages.

I personally find Tree-sitter useful in many cases, and I bet you will, too.