codetect v2.0.0: From Line Chunks to AST-Based Code Understanding

This is part 3 of the codetect series:

Part 1: Building an MCP Code Search Tool (v0)
Part 2: When Better Models Aren't Enough (v1)
Part 3: From Line Chunks to AST-Based Understanding (v2) ← You are here

The Problem with Naive Chunking

Picture this: you're searching for authentication logic in your codebase. Your semantic search returns a chunk that starts mid-function, cutting off the function signature and half the context. The embedding captured some relevant keywords, but the chunk boundary destroyed the very structure that makes code comprehensible.

This was the reality of codetect v0 and v1. Despite increasingly powerful embedding models, the fundamental problem remained: line-based chunking doesn't understand code structure.

Here's what that looked like:

Chunk 1 (lines 1-512):
  function calculateTotal() {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
      sum += items[i].price;
    }
    return sum;
  }

  function processOrder(order) {  // ← Function split here!
    const total = calculateTotal();
    const tax = total * 0.08;

[CHUNK BOUNDARY - Context lost!]

Chunk 2 (lines 463-975, 50-line overlap):
    const tax = total * 0.08;      // ← Starts mid-function
    return {
      subtotal: total,
      tax: tax,
      total: total + tax
    };
  }

No amount of model sophistication can fix chunks that split functions in half.

Quick Context: The Journey to v2

codetect started in November 2025 when I noticed my Claude API costs climbing and coworkers saying Cursor felt faster than Claude Code. The key difference? Codebase indexing.

So I built an MCP-native code search tool that works with any LLM supporting the Model Context Protocol—not just one vendor. Local-first, no token costs, open source.

(For the full origin story, see Part 1: Building an MCP Code Search Tool)

The Evolution: v0 → v1 → v2

v0 (November 2025): MVP with line-based chunking (512 lines, 50-line overlap). Simple, shipped fast, validated the MCP-native approach. But functions got split awkwardly across chunks.

v1 (January 2026): Added PostgreSQL + pgvector (60x faster search), better embedding models (bge-m3), multi-repo support, and an eval framework. The surprise? Better models barely improved quality. The eval revealed: ~40% of search results were incomplete functions because line-based chunking doesn't respect code structure.

Key insight from v1: Better models don't fix bad chunks. We needed to chunk by semantic units (functions, classes) instead of arbitrary lines.

v2: AST-Based Intelligence (February 2026)

Philosophy: Chunk code the way developers think about it.

This is where everything changed. Instead of treating code like text, v2 understands it:

AST traversal: Parse code into syntax trees, chunk by semantic units (functions, classes, modules)
Tree-sitter integration: Support for 10+ languages (Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Ruby, PHP)
Merkle tree change detection: Sub-second incremental updates (detect what changed at chunk level)
Content-addressed embedding cache: 95% cache hit rate on incremental updates (only re-embed changed chunks)
Dimension-grouped tables: Multiple repos can use different models without conflicts
Parallel embedding: 3.3x faster with configurable workers (-j flag)

Result: Actually good code search that understands structure.

The Breakthrough: AST-Based Chunking

The core innovation in v2 is simple but powerful: parse code before chunking it.

What is AST traversal?

An Abstract Syntax Tree (AST) represents code's grammatical structure. Instead of seeing code as lines of text, we see it as a tree of functions, classes, and statements.

Tree-sitter—a parser generator used by GitHub, Neovim, and others—makes this practical. It's fast, incremental, and supports 10+ languages out of the box.

Why it matters for code search:

When you chunk by functions instead of lines:

Embeddings capture complete semantic units
Search results include full function signatures and bodies
Context is preserved (no mid-function splits)
Smaller chunks mean faster search and better retrieval

Before (v0/v1): Arbitrary 512-line chunks with 50-line overlap.

After (v2): Each function, class, or module is its own chunk.

// v2: Each function is a complete chunk
Chunk 1 (function: calculateTotal):
  function calculateTotal() {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
      sum += items[i].price;
    }
    return sum;
  }

Chunk 2 (function: processOrder):
  function processOrder(order) {
    const total = calculateTotal();
    const tax = total * 0.08;
    return {
      subtotal: total,
      tax: tax,
      total: total + tax
    };
  }

Clean boundaries. Complete context. Better embeddings.

Performance Wins

The numbers tell the story:

Incremental Indexing Performance (v1 → v2)

Repo Size	v1.x (line-based)	v2.0 (Merkle + AST)	Speedup
100 files	30s	2s	15x faster
1,000 files	5m	20s	15x faster
5,000 files	25m	1m 40s	15x faster

Embedding Performance (v1 → v2)

Operation	v1.13.0	v2.0.0 (sequential)	v2.0.0 (-j 10)	Speedup
100 files	45s	45s	12s	3.75x
1,000 files	7m 30s	7m 30s	2m 15s	3.3x
5,000 files	37m 30s	37m 30s	11m 15s	3.3x

Search Quality (v0 → v1 → v2)

Metric	v0 (line chunks)	v1 (better models)	v2 (AST chunks)
Retrieval accuracy	60%	65%	85%
Context preserved	Poor	Poor	Excellent
Function completeness	40%	40%	95%

Tested on 1000-query eval suite across 10 open-source repos

Key takeaway: Merkle tree change detection (15x faster incremental indexing) + content-addressed caching (95% hit rate) + parallel embedding (3.3x faster) = a tool that actually keeps up with your development workflow.

Multi-Repo Architecture

v2 introduces dimension-grouped embedding tables, enabling a critical capability: multiple repositories can use different embedding models without conflicts.

Why this matters:

Organizations have diverse codebases (Python microservices, Go services, JavaScript frontends)
Different languages benefit from different embedding models
Teams want centralized search infrastructure without forcing model uniformity

How it works:

Embeddings are stored in tables grouped by dimension (e.g., embeddings_768, embeddings_1024)
Each repo tracks its embedding model in metadata
Search queries automatically route to the correct table
Migration from v1 is automatic—the first index run detects and upgrades your schema

Deployment options:

Local SQLite: Perfect for individual developers
Shared PostgreSQL: Team-wide search infrastructure
litellm adapter: Optional cloud LLM integration for embedding generation

Developer Experience

Performance is only half the story. v2 includes UX improvements that make it feel like a mature tool:

Zero breaking changes: v1.x indexes auto-upgrade on first v2 run
Automatic dimension migration: Switching embedding models "just works"
Short flag aliases: -f for --force, -j for --parallel (Unix-style UX)
Config preservation: Reinstalls no longer overwrite user settings
Better error messages: Clearer diagnostics when something goes wrong
Model selection in eval runner: Choose sonnet, haiku, or opus with cost-aware defaults

Getting Started

# Install codetect v2.0.0
git clone https://github.com/brian-lai/codetect.git
cd codetect
./install.sh

# In your project
cd /path/to/your/project
codetect init

# v2.0: Use AST-based indexer with parallel embedding
codetect index --v2             # AST chunking + Merkle tree
codetect embed -j 10            # Parallel embedding (10 workers)

# Start Claude Code (or any MCP-compatible LLM)
claude

That's it. codetect runs as an MCP server in the background, providing semantic search to your LLM tool.

Documentation:

What's Next: v3.0 Roadmap

v2 laid the foundation. Here's where we're headed:

LSP integration: Real-time indexing as you type (no manual codetect index needed)
Graph-based navigation: Call graphs, dependency graphs, semantic relationships
Distributed indexing: Horizontal scaling for monorepos (think Google-scale codebases)
Smart chunking strategies: Language-specific optimizations (e.g., treating React components differently than utility functions)

Interested in contributing? Check out the GitHub issues.

Lessons Learned: v0 → v2 in 3 Months

Building codetect from zero to v2 in three months taught me a few things:

Start simple (v0): Line-based chunking got something working fast. Perfect is the enemy of shipped.
Measure what matters (v1): Adding better models didn't improve results. The eval framework (added in v1.x) revealed chunking as the bottleneck.
Fix the root cause (v2): AST-based chunking addressed the real problem. No amount of model sophistication can fix bad chunks.
Structure > sophistication: Respecting code structure beats fancy models every time.
Iterate based on data: Without the eval framework, I would've kept throwing models at the problem instead of fixing chunking.

What I'd do differently:

Jump to AST-based chunking sooner (but v0/v1 taught valuable lessons)
Add the eval framework from day 1 (hard to improve what you don't measure)
Consider tree-sitter earlier (AST parsing is a solved problem—don't reinvent it)

Try codetect v2.0.0

codetect is open source (MIT license) and built for developers who want fast, local-first code search for any MCP-compatible LLM.

Links:

If you're using Claude Code, Continue, or any MCP-compatible tool, give codetect a try. And if you find it useful, star the repo and share your feedback.

Happy coding.