This is part 3 of the codetect series:
- Part 1: Building an MCP Code Search Tool (v0)
- Part 2: When Better Models Aren't Enough (v1)
- Part 3: From Line Chunks to AST-Based Understanding (v2) ← You are here
- Part 4: When Every Improvement Made Things Worse
The Problem with Naive Chunking
Picture this: you're searching for authentication logic in your codebase. Your semantic search returns a chunk that starts mid-function, cutting off the function signature and half the context. The embedding captured some relevant keywords, but the chunk boundary destroyed the very structure that makes code comprehensible.
This was the reality of codetect v0 and v1. Despite increasingly powerful embedding models, the fundamental problem remained: line-based chunking doesn't understand code structure.
Here's what that looked like:
Chunk 1 (lines 1-512):
function calculateTotal() {
let sum = 0;
for (let i = 0; i < items.length; i++) {
sum += items[i].price;
}
return sum;
}
function processOrder(order) { // ← Function split here!
const total = calculateTotal();
const tax = total * 0.08;
[CHUNK BOUNDARY - Context lost!]
Chunk 2 (lines 463-975, 50-line overlap):
const tax = total * 0.08; // ← Starts mid-function
return {
subtotal: total,
tax: tax,
total: total + tax
};
}
No amount of model sophistication can fix chunks that split functions in half.
Quick Context: The Journey to v2
codetect started in November 2025 when I noticed my Claude API costs climbing and coworkers saying Cursor felt faster than Claude Code. The key difference? Codebase indexing.
So I built an MCP-native code search tool that works with any LLM supporting the Model Context Protocol—not just one vendor. Local-first, no token costs, open source.
(For the full origin story, see Part 1: Building an MCP Code Search Tool)
The Evolution: v0 → v1 → v2
v0 (November 2025): MVP with line-based chunking (512 lines, 50-line overlap). Simple, shipped fast, validated the MCP-native approach. But functions got split awkwardly across chunks.
v1 (January 2026): Added PostgreSQL + pgvector (60x faster search at scale), better embedding models (bge-m3), multi-repo support, and an eval framework. The surprise? Better models barely improved quality. The eval revealed: roughly 60% of search results contained incomplete functions because line-based chunking doesn't respect code structure.
Key insight from v1: Better models don't fix bad chunks. We needed to chunk by semantic units (functions, classes) instead of arbitrary lines.
v2: AST-Based Intelligence (February 2026)
Philosophy: Chunk code the way developers think about it.
This is where everything changed. Instead of treating code like text, v2 understands it:
- AST traversal: Parse code into syntax trees, chunk by semantic units (functions, classes, modules)
- Tree-sitter integration: Support for 10+ languages (Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Ruby, PHP)
- Merkle tree change detection: Sub-second incremental updates (detect what changed at chunk level)
- Content-addressed embedding cache: 95% cache hit rate on incremental updates (only re-embed changed chunks)
- Dimension-grouped tables: Multiple repos can use different models without conflicts
- Parallel embedding: 3.3x faster with configurable workers (
-jflag)
Result: Code search that understands structure—though as we'd learn, precision and usefulness aren't always the same thing.
The Breakthrough: AST-Based Chunking
The core innovation in v2 is simple but powerful: parse code before chunking it.
What is AST traversal?
An Abstract Syntax Tree (AST) represents code's grammatical structure. Instead of seeing code as lines of text, we see it as a tree of functions, classes, and statements.
Tree-sitter—a parser generator used by GitHub, Neovim, and others—makes this practical. It's fast, incremental, and supports 10+ languages out of the box.
Why it matters for code search:
When you chunk by functions instead of lines:
- Embeddings capture complete semantic units
- Search results include full function signatures and bodies
- Context is preserved (no mid-function splits)
- Smaller chunks mean faster search and better retrieval
Before (v0/v1): Arbitrary 512-line chunks with 50-line overlap.
After (v2): Each function, class, or module is its own chunk.
// v2: Each function is a complete chunk
Chunk 1 (function: calculateTotal):
function calculateTotal() {
let sum = 0;
for (let i = 0; i < items.length; i++) {
sum += items[i].price;
}
return sum;
}
Chunk 2 (function: processOrder):
function processOrder(order) {
const total = calculateTotal();
const tax = total * 0.08;
return {
subtotal: total,
tax: tax,
total: total + tax
};
}
Clean boundaries. Complete context. Better embeddings.
Performance Wins
The numbers tell the story:
Incremental Indexing Performance (v1 → v2)
| Repo Size | v1.x (line-based) | v2.0 (Merkle + AST) | Speedup |
|---|---|---|---|
| 100 files | 30s | 2s | 15x faster |
| 1,000 files | 5m | 20s | 15x faster |
| 5,000 files | 25m | 1m 40s | 15x faster |
Embedding Performance (v1 → v2)
| Operation | v1.13.0 | v2.0.0 (sequential) | v2.0.0 (-j 10) | Speedup |
|---|---|---|---|---|
| 100 files | 45s | 45s | 12s | 3.75x |
| 1,000 files | 7m 30s | 7m 30s | 2m 15s | 3.3x |
| 5,000 files | 37m 30s | 37m 30s | 11m 15s | 3.3x |
Search Quality (v0 → v1 → v2)
| Metric | v0 (line chunks) | v1 (better models) | v2 (AST chunks) |
|---|---|---|---|
| Retrieval accuracy (F1) | ~60% | ~63% | ~68% |
| Context preserved | Poor | Poor | Good |
| Function completeness | ~40% | ~40% | ~80% |
| Token overhead vs baseline | ~-10% | not measured | ~-6.5% |
Accuracy measured via eval framework with enriched context enabled. Token overhead measured against Claude Code's built-in tools as baseline.
Key takeaway: Merkle tree change detection (15x faster incremental indexing) + content-addressed caching (95% hit rate) + parallel embedding (3.3x faster) = a tool that actually keeps up with your development workflow. But as we'd soon discover, these precision improvements came with tradeoffs for agent consumption (Part 4).
Multi-Repo Architecture
v2 introduces dimension-grouped embedding tables, enabling a critical capability: multiple repositories can use different embedding models without conflicts.
Why this matters:
- Organizations have diverse codebases (Python microservices, Go services, JavaScript frontends)
- Different languages benefit from different embedding models
- Teams want centralized search infrastructure without forcing model uniformity
How it works:
- Embeddings are stored in tables grouped by dimension (e.g.,
embeddings_768,embeddings_1024) - Each repo tracks its embedding model in metadata
- Search queries automatically route to the correct table
- Migration from v1 is automatic—the first index run detects and upgrades your schema
Deployment options:
- Local SQLite: Perfect for individual developers
- Shared PostgreSQL: Team-wide search infrastructure
- litellm adapter: Optional cloud LLM integration for embedding generation
Developer Experience
Performance is only half the story. v2 includes UX improvements that make it feel like a mature tool:
- Zero breaking changes: v1.x indexes auto-upgrade on first v2 run
- Automatic dimension migration: Switching embedding models "just works"
- Short flag aliases:
-ffor--force,-jfor--parallel(Unix-style UX) - Config preservation: Reinstalls no longer overwrite user settings
- Better error messages: Clearer diagnostics when something goes wrong
- Model selection in eval runner: Choose
sonnet,haiku, oropuswith cost-aware defaults
Getting Started
# Install codetect v2.0.0
git clone https://github.com/brian-lai/codetect.git
cd codetect
./install.sh
# In your project
cd /path/to/your/project
codetect init
# v2.0: Use AST-based indexer with parallel embedding
codetect index --v2 # AST chunking + Merkle tree
codetect embed -j 10 # Parallel embedding (10 workers)
# Start Claude Code (or any MCP-compatible LLM)
claude
That's it. codetect runs as an MCP server in the background, providing semantic search to your LLM tool.
Documentation:
What's Next
v2 shipped with real infrastructure wins: 15x faster incremental indexing, 3.3x faster embedding, and precise function-level chunks. The engineering was solid.
But shipping is just the beginning. After running v2 in daily development for a week, I started noticing something unexpected: the agent was making more searches to get the same answers it used to find in one shot. The precise chunks that looked great on eval metrics were missing something that the old sloppy line-based chunks provided for free.
That's the story of Part 4: When Every Improvement Made Things Worse.
Interested in contributing? Check out the GitHub issues.
Lessons So Far: v0 → v2 in 3 Months
Building codetect from zero to v2 in three months taught me a few things:
- Start simple (v0): Line-based chunking got something working fast. Perfect is the enemy of shipped.
- Measure what matters (v1): Adding better models didn't improve results. The eval framework (added in v1.x) revealed chunking as the bottleneck.
- Fix the root cause (v2): AST-based chunking addressed function completeness. But as I'd soon learn, the "root cause" was more nuanced than I thought.
- Structure matters: Respecting code structure improves precision—but precision isn't the only thing that matters for agent workflows.
- Iterate based on data: Without the eval framework, I would've kept throwing models at the problem instead of fixing chunking. But you also need to measure the right thing.
What I'd do differently:
- Add the eval framework from day 1 (hard to improve what you don't measure)
- Measure end-to-end agent performance (tokens-to-answer), not just retrieval precision
- Consider tree-sitter earlier (AST parsing is a solved problem—don't reinvent it)
Try codetect v2.0.0
codetect is open source (MIT license) and built for developers who want fast, local-first code search for any MCP-compatible LLM.
Links:
If you're using Claude Code, Continue, or any MCP-compatible tool, give codetect a try. And if you find it useful, star the repo and share your feedback.
Happy coding.