This is part 3 of the codetect series:
- Part 1: Building an MCP Code Search Tool (v0)
- Part 2: When Better Models Aren't Enough (v1)
- Part 3: From Line Chunks to AST-Based Understanding (v2) ← You are here
The Problem with Naive Chunking
Picture this: you're searching for authentication logic in your codebase. Your semantic search returns a chunk that starts mid-function, cutting off the function signature and half the context. The embedding captured some relevant keywords, but the chunk boundary destroyed the very structure that makes code comprehensible.
This was the reality of codetect v0 and v1. Despite increasingly powerful embedding models, the fundamental problem remained: line-based chunking doesn't understand code structure.
Here's what that looked like:
Chunk 1 (lines 1-512):
function calculateTotal() {
let sum = 0;
for (let i = 0; i < items.length; i++) {
sum += items[i].price;
}
return sum;
}
function processOrder(order) { // ← Function split here!
const total = calculateTotal();
const tax = total * 0.08;
[CHUNK BOUNDARY - Context lost!]
Chunk 2 (lines 463-975, 50-line overlap):
const tax = total * 0.08; // ← Starts mid-function
return {
subtotal: total,
tax: tax,
total: total + tax
};
}
No amount of model sophistication can fix chunks that split functions in half.
Quick Context: The Journey to v2
codetect started in November 2025 when I noticed my Claude API costs climbing and coworkers saying Cursor felt faster than Claude Code. The key difference? Codebase indexing.
So I built an MCP-native code search tool that works with any LLM supporting the Model Context Protocol—not just one vendor. Local-first, no token costs, open source.
(For the full origin story, see Part 1: Building an MCP Code Search Tool)
The Evolution: v0 → v1 → v2
v0 (November 2025): MVP with line-based chunking (512 lines, 50-line overlap). Simple, shipped fast, validated the MCP-native approach. But functions got split awkwardly across chunks.
v1 (January 2026): Added PostgreSQL + pgvector (60x faster search), better embedding models (bge-m3), multi-repo support, and an eval framework. The surprise? Better models barely improved quality. The eval revealed: ~40% of search results were incomplete functions because line-based chunking doesn't respect code structure.
Key insight from v1: Better models don't fix bad chunks. We needed to chunk by semantic units (functions, classes) instead of arbitrary lines.
v2: AST-Based Intelligence (February 2026)
Philosophy: Chunk code the way developers think about it.
This is where everything changed. Instead of treating code like text, v2 understands it:
- AST traversal: Parse code into syntax trees, chunk by semantic units (functions, classes, modules)
- Tree-sitter integration: Support for 10+ languages (Go, Python, JavaScript, TypeScript, Rust, Java, C, C++, Ruby, PHP)
- Merkle tree change detection: Sub-second incremental updates (detect what changed at chunk level)
- Content-addressed embedding cache: 95% cache hit rate on incremental updates (only re-embed changed chunks)
- Dimension-grouped tables: Multiple repos can use different models without conflicts
- Parallel embedding: 3.3x faster with configurable workers (
-jflag)
Result: Actually good code search that understands structure.
The Breakthrough: AST-Based Chunking
The core innovation in v2 is simple but powerful: parse code before chunking it.
What is AST traversal?
An Abstract Syntax Tree (AST) represents code's grammatical structure. Instead of seeing code as lines of text, we see it as a tree of functions, classes, and statements.
Tree-sitter—a parser generator used by GitHub, Neovim, and others—makes this practical. It's fast, incremental, and supports 10+ languages out of the box.
Why it matters for code search:
When you chunk by functions instead of lines:
- Embeddings capture complete semantic units
- Search results include full function signatures and bodies
- Context is preserved (no mid-function splits)
- Smaller chunks mean faster search and better retrieval
Before (v0/v1): Arbitrary 512-line chunks with 50-line overlap.
After (v2): Each function, class, or module is its own chunk.
// v2: Each function is a complete chunk
Chunk 1 (function: calculateTotal):
function calculateTotal() {
let sum = 0;
for (let i = 0; i < items.length; i++) {
sum += items[i].price;
}
return sum;
}
Chunk 2 (function: processOrder):
function processOrder(order) {
const total = calculateTotal();
const tax = total * 0.08;
return {
subtotal: total,
tax: tax,
total: total + tax
};
}
Clean boundaries. Complete context. Better embeddings.
Performance Wins
The numbers tell the story:
Incremental Indexing Performance (v1 → v2)
| Repo Size | v1.x (line-based) | v2.0 (Merkle + AST) | Speedup |
|---|---|---|---|
| 100 files | 30s | 2s | 15x faster |
| 1,000 files | 5m | 20s | 15x faster |
| 5,000 files | 25m | 1m 40s | 15x faster |
Embedding Performance (v1 → v2)
| Operation | v1.13.0 | v2.0.0 (sequential) | v2.0.0 (-j 10) | Speedup |
|---|---|---|---|---|
| 100 files | 45s | 45s | 12s | 3.75x |
| 1,000 files | 7m 30s | 7m 30s | 2m 15s | 3.3x |
| 5,000 files | 37m 30s | 37m 30s | 11m 15s | 3.3x |
Search Quality (v0 → v1 → v2)
| Metric | v0 (line chunks) | v1 (better models) | v2 (AST chunks) |
|---|---|---|---|
| Retrieval accuracy | 60% | 65% | 85% |
| Context preserved | Poor | Poor | Excellent |
| Function completeness | 40% | 40% | 95% |
Tested on 1000-query eval suite across 10 open-source repos
Key takeaway: Merkle tree change detection (15x faster incremental indexing) + content-addressed caching (95% hit rate) + parallel embedding (3.3x faster) = a tool that actually keeps up with your development workflow.
Multi-Repo Architecture
v2 introduces dimension-grouped embedding tables, enabling a critical capability: multiple repositories can use different embedding models without conflicts.
Why this matters:
- Organizations have diverse codebases (Python microservices, Go services, JavaScript frontends)
- Different languages benefit from different embedding models
- Teams want centralized search infrastructure without forcing model uniformity
How it works:
- Embeddings are stored in tables grouped by dimension (e.g.,
embeddings_768,embeddings_1024) - Each repo tracks its embedding model in metadata
- Search queries automatically route to the correct table
- Migration from v1 is automatic—the first index run detects and upgrades your schema
Deployment options:
- Local SQLite: Perfect for individual developers
- Shared PostgreSQL: Team-wide search infrastructure
- litellm adapter: Optional cloud LLM integration for embedding generation
Developer Experience
Performance is only half the story. v2 includes UX improvements that make it feel like a mature tool:
- Zero breaking changes: v1.x indexes auto-upgrade on first v2 run
- Automatic dimension migration: Switching embedding models "just works"
- Short flag aliases:
-ffor--force,-jfor--parallel(Unix-style UX) - Config preservation: Reinstalls no longer overwrite user settings
- Better error messages: Clearer diagnostics when something goes wrong
- Model selection in eval runner: Choose
sonnet,haiku, oropuswith cost-aware defaults
Getting Started
# Install codetect v2.0.0
git clone https://github.com/brian-lai/codetect.git
cd codetect
./install.sh
# In your project
cd /path/to/your/project
codetect init
# v2.0: Use AST-based indexer with parallel embedding
codetect index --v2 # AST chunking + Merkle tree
codetect embed -j 10 # Parallel embedding (10 workers)
# Start Claude Code (or any MCP-compatible LLM)
claude
That's it. codetect runs as an MCP server in the background, providing semantic search to your LLM tool.
Documentation:
What's Next: v3.0 Roadmap
v2 laid the foundation. Here's where we're headed:
- LSP integration: Real-time indexing as you type (no manual
codetect indexneeded) - Graph-based navigation: Call graphs, dependency graphs, semantic relationships
- Distributed indexing: Horizontal scaling for monorepos (think Google-scale codebases)
- Smart chunking strategies: Language-specific optimizations (e.g., treating React components differently than utility functions)
Interested in contributing? Check out the GitHub issues.
Lessons Learned: v0 → v2 in 3 Months
Building codetect from zero to v2 in three months taught me a few things:
- Start simple (v0): Line-based chunking got something working fast. Perfect is the enemy of shipped.
- Measure what matters (v1): Adding better models didn't improve results. The eval framework (added in v1.x) revealed chunking as the bottleneck.
- Fix the root cause (v2): AST-based chunking addressed the real problem. No amount of model sophistication can fix bad chunks.
- Structure > sophistication: Respecting code structure beats fancy models every time.
- Iterate based on data: Without the eval framework, I would've kept throwing models at the problem instead of fixing chunking.
What I'd do differently:
- Jump to AST-based chunking sooner (but v0/v1 taught valuable lessons)
- Add the eval framework from day 1 (hard to improve what you don't measure)
- Consider tree-sitter earlier (AST parsing is a solved problem—don't reinvent it)
Try codetect v2.0.0
codetect is open source (MIT license) and built for developers who want fast, local-first code search for any MCP-compatible LLM.
Links:
If you're using Claude Code, Continue, or any MCP-compatible tool, give codetect a try. And if you find it useful, star the repo and share your feedback.
Happy coding.