codetect v2.0.0: From Line Chunks to AST-Based Code Understanding

This is part 3 of the codetect series:

Part 1: Building an MCP Code Search Tool (v0)
Part 2: When Better Models Aren't Enough (v1)
Part 3: From Line Chunks to AST-Based Understanding (v2) ← You are here
Part 4: When Every Improvement Made Things Worse

The Problem with Naive Chunking

Picture this: you're searching for authentication logic in your codebase. Your semantic search returns a chunk that starts mid-function, cutting off the function signature and half the context. The embedding captured some relevant keywords, but the chunk boundary destroyed the very structure that makes code comprehensible.

This was the reality of codetect v0 and v1. Despite increasingly powerful embedding models, the fundamental problem remained: line-based chunking doesn't understand code structure.

Here's what that looked like:

Chunk 1 (lines 1-512):
  function calculateTotal() {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
      sum += items[i].price;
    }
    return sum;
  }

  function processOrder(order) {  // ← Function split here!
    const total = calculateTotal();
    const tax = total * 0.08;

[CHUNK BOUNDARY - Context lost!]

Chunk 2 (lines 463-975, 50-line overlap):
    const tax = total * 0.08;      // ← Starts mid-function
    return {
      subtotal: total,
      tax: tax,
      total: total + tax
    };
  }

No amount of model sophistication can fix chunks that split functions in half.

Quick Context: The Journey to v2

codetect started in November 2025 when I noticed my Claude API costs climbing and coworkers saying Cursor felt faster than Claude Code. The key difference? Codebase indexing.

So I built an MCP-native code search tool that works with any LLM supporting the Model Context Protocol—not just one vendor. Local-first, no token costs, open source.

(For the full origin story, see Part 1: Building an MCP Code Search Tool)

The Evolution: v0 → v1 → v2

v0 (November 2025): MVP with line-based chunking (512 lines, 50-line overlap). Simple, shipped fast, validated the MCP-native approach. But functions got split awkwardly across chunks.

v1 (January 2026): Added PostgreSQL + pgvector (60x faster search at scale), better embedding models (bge-m3), multi-repo support, and an eval framework. The surprise? Better models barely improved quality. The eval revealed: roughly 60% of search results contained incomplete functions because line-based chunking doesn't respect code structure.

Key insight from v1: Better models don't fix bad chunks. We needed to chunk by semantic units (functions, classes) instead of arbitrary lines.

v2: AST-Based Intelligence (February 2026)

Philosophy: Chunk code the way developers think about it.

This is where everything changed. Instead of treating code like text, v2 understands it:

AST traversal: Parse code into syntax trees, chunk by semantic units (functions, classes, modules)
Tree-sitter integration: Support for 10+ languages (Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Ruby, PHP)
Merkle tree change detection: Sub-second incremental updates (detect what changed at chunk level)
Content-addressed embedding cache: 95% cache hit rate on incremental updates (only re-embed changed chunks)
Dimension-grouped tables: Multiple repos can use different models without conflicts
Parallel embedding: 3.3x faster with configurable workers (-j flag)

Result: Code search that understands structure—though as we'd learn, precision and usefulness aren't always the same thing.

The Breakthrough: AST-Based Chunking

The core innovation in v2 is simple but powerful: parse code before chunking it.

What is AST traversal?

An Abstract Syntax Tree (AST) represents code's grammatical structure. Instead of seeing code as lines of text, we see it as a tree of functions, classes, and statements.

Tree-sitter—a parser generator used by GitHub, Neovim, and others—makes this practical. It's fast, incremental, and supports 10+ languages out of the box.

Why it matters for code search:

When you chunk by functions instead of lines:

Embeddings capture complete semantic units
Search results include full function signatures and bodies
Context is preserved (no mid-function splits)
Smaller chunks mean faster search and better retrieval

Before (v0/v1): Arbitrary 512-line chunks with 50-line overlap.

After (v2): Each function, class, or module is its own chunk.

// v2: Each function is a complete chunk
Chunk 1 (function: calculateTotal):
  function calculateTotal() {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
      sum += items[i].price;
    }
    return sum;
  }

Chunk 2 (function: processOrder):
  function processOrder(order) {
    const total = calculateTotal();
    const tax = total * 0.08;
    return {
      subtotal: total,
      tax: tax,
      total: total + tax
    };
  }

Clean boundaries. Complete context. Better embeddings.

Performance Wins

The numbers tell the story:

Incremental Indexing Performance (v1 → v2)

Repo Size	v1.x (line-based)	v2.0 (Merkle + AST)	Speedup
100 files	30s	2s	15x faster
1,000 files	5m	20s	15x faster
5,000 files	25m	1m 40s	15x faster

Embedding Performance (v1 → v2)

Operation	v1.13.0	v2.0.0 (sequential)	v2.0.0 (-j 10)	Speedup
100 files	45s	45s	12s	3.75x
1,000 files	7m 30s	7m 30s	2m 15s	3.3x
5,000 files	37m 30s	37m 30s	11m 15s	3.3x

Search Quality (v0 → v1 → v2)

Metric	v0 (line chunks)	v1 (better models)	v2 (AST chunks)
Retrieval accuracy (F1)	~60%	~63%	~68%
Context preserved	Poor	Poor	Good
Function completeness	~40%	~40%	~80%
Token overhead vs baseline	~-10%	not measured	~-6.5%

Accuracy measured via eval framework with enriched context enabled. Token overhead measured against Claude Code's built-in tools as baseline.

Key takeaway: Merkle tree change detection (15x faster incremental indexing) + content-addressed caching (95% hit rate) + parallel embedding (3.3x faster) = a tool that actually keeps up with your development workflow. But as we'd soon discover, these precision improvements came with tradeoffs for agent consumption (Part 4).

Multi-Repo Architecture

v2 introduces dimension-grouped embedding tables, enabling a critical capability: multiple repositories can use different embedding models without conflicts.

Why this matters:

Organizations have diverse codebases (Python microservices, Go services, JavaScript frontends)
Different languages benefit from different embedding models
Teams want centralized search infrastructure without forcing model uniformity

How it works:

Embeddings are stored in tables grouped by dimension (e.g., embeddings_768, embeddings_1024)
Each repo tracks its embedding model in metadata
Search queries automatically route to the correct table
Migration from v1 is automatic—the first index run detects and upgrades your schema

Deployment options:

Local SQLite: Perfect for individual developers
Shared PostgreSQL: Team-wide search infrastructure
litellm adapter: Optional cloud LLM integration for embedding generation

Developer Experience

Performance is only half the story. v2 includes UX improvements that make it feel like a mature tool:

Zero breaking changes: v1.x indexes auto-upgrade on first v2 run
Automatic dimension migration: Switching embedding models "just works"
Short flag aliases: -f for --force, -j for --parallel (Unix-style UX)
Config preservation: Reinstalls no longer overwrite user settings
Better error messages: Clearer diagnostics when something goes wrong
Model selection in eval runner: Choose sonnet, haiku, or opus with cost-aware defaults

Getting Started

# Install codetect v2.0.0
git clone https://github.com/brian-lai/codetect.git
cd codetect
./install.sh

# In your project
cd /path/to/your/project
codetect init

# v2.0: Use AST-based indexer with parallel embedding
codetect index --v2             # AST chunking + Merkle tree
codetect embed -j 10            # Parallel embedding (10 workers)

# Start Claude Code (or any MCP-compatible LLM)
claude

That's it. codetect runs as an MCP server in the background, providing semantic search to your LLM tool.

Documentation:

What's Next

v2 shipped with real infrastructure wins: 15x faster incremental indexing, 3.3x faster embedding, and precise function-level chunks. The engineering was solid.

But shipping is just the beginning. After running v2 in daily development for a week, I started noticing something unexpected: the agent was making more searches to get the same answers it used to find in one shot. The precise chunks that looked great on eval metrics were missing something that the old sloppy line-based chunks provided for free.

That's the story of Part 4: When Every Improvement Made Things Worse.

Interested in contributing? Check out the GitHub issues.

Lessons So Far: v0 → v2 in 3 Months

Building codetect from zero to v2 in three months taught me a few things:

Start simple (v0): Line-based chunking got something working fast. Perfect is the enemy of shipped.
Measure what matters (v1): Adding better models didn't improve results. The eval framework (added in v1.x) revealed chunking as the bottleneck.
Fix the root cause (v2): AST-based chunking addressed function completeness. But as I'd soon learn, the "root cause" was more nuanced than I thought.
Structure matters: Respecting code structure improves precision—but precision isn't the only thing that matters for agent workflows.
Iterate based on data: Without the eval framework, I would've kept throwing models at the problem instead of fixing chunking. But you also need to measure the right thing.

What I'd do differently:

Add the eval framework from day 1 (hard to improve what you don't measure)
Measure end-to-end agent performance (tokens-to-answer), not just retrieval precision
Consider tree-sitter earlier (AST parsing is a solved problem—don't reinvent it)

Try codetect v2.0.0

codetect is open source (MIT license) and built for developers who want fast, local-first code search for any MCP-compatible LLM.

Links:

If you're using Claude Code, Continue, or any MCP-compatible tool, give codetect a try. And if you find it useful, star the repo and share your feedback.

Happy coding.