codetect v1: When Better Models Aren't Enough

This is part 2 of the codetect series:

Part 1: Building an MCP Code Search Tool (v0)
Part 2: When Better Models Aren't Enough (v1) ← You are here
Part 3: From Line Chunks to AST-Based Understanding (v2)

The Scaling Wall

After shipping v0 in November (still called repo-search back then), I started using it daily on real projects. Small codebases (~500 files) worked great. But try it on a larger codebase—5,000+ files—and cracks appeared:

Search was slow. Scanning 10,000+ embeddings in SQLite meant 200-500ms queries. Not terrible, but noticeably laggy.
Indexing was slow. Embedding 5,000 files took 7+ minutes. Every code change meant waiting.
Semantic search quality was... fine? Sometimes great, often mediocre. Hard to tell if it was helping or hurting.

The obvious answer: better infrastructure, better models.

So that's what I built for v1.

What We Added in v1

1. PostgreSQL + pgvector + HNSW

SQLite is fantastic for small-scale vector search, but it doesn't have specialized indexing for high-dimensional vectors. PostgreSQL with pgvector does.

The upgrade:

Replaced SQLite with PostgreSQL + pgvector extension
Added HNSW (Hierarchical Navigable Small World) indexing
Result: 60x faster search on 10K+ vectors (500ms → 8ms)

HNSW is an approximate nearest neighbor algorithm that trades a tiny bit of accuracy for massive speed gains. For code search, "99% accurate in 8ms" beats "100% accurate in 500ms" every time.

2. Better Embedding Models (and a Performance Surprise)

v0 used nomic-embed-text (768 dimensions). Good for an MVP, but newer models promised better semantic understanding:

bge-m3: 1024 dimensions, optimized for retrieval tasks
mxbai-embed-large: 1024 dimensions, strong performance on code

I added support for multiple models and dimension sizes, letting users choose based on their hardware and accuracy needs.

But there was a catch.

Better models meant slower embedding. Much slower. With nomic-embed-text, embedding a medium-sized codebase (~1,000 files) took 2-3 minutes. With bge-m3? Nearly 30 minutes.

The model was more accurate, but 10x slower for local embedding. For developers running codetect on their laptops, this was a dealbreaker.

The solution: Keep both options. Users with access to cloud servers or beefy hardware could use bge-m3 for better quality. Everyone else could stick with nomic-embed-text for speed. (Later, in v2, parallel embedding would solve this—but we're getting ahead of ourselves.)

This was my first hint that "better models" came with real tradeoffs. Quality vs. speed. Cloud vs. local. The best model isn't always the right choice.

3. Multi-Repo Database Architecture

v0 was single-repo: one database per project. For individuals, fine. For teams? Pain.

v1 introduced a centralized database schema:

Multiple repos in one database
Repo-scoped search queries
Shared infrastructure (PostgreSQL server for the whole team)

This meant one codetect server could index dozens of repos and serve searches across all of them.

4. Eval Framework

This was the most important addition, even if it wasn't user-facing.

I built a small eval framework:

1,000 test queries across 10 open-source repos
Ground truth: manually verified "correct" results for each query
Metrics: retrieval accuracy, context completeness, function completeness

Now I could measure whether changes actually improved search quality. No more guessing.

5. A New Name

With all these improvements—multi-repo support, multiple embedding models, three distinct search modes—the name repo-search felt too generic.

The tool wasn't just searching repos. It was detecting code patterns through keyword search, symbol navigation, and semantic embeddings. Three complementary ways to find what you need.

So repo-search became codetect.

Performance Wins

The infrastructure upgrades delivered:

Metric	v0 (SQLite)	v1 (PostgreSQL + HNSW)	Improvement
Search time (1K vectors)	50ms	5ms	10x faster
Search time (10K vectors)	500ms	8ms	60x faster
Multi-repo support	❌	✅	—

Great! We'd solved the performance problem. Time to celebrate, right?

The Surprise: Better Models Didn't Help

Here's what I expected:

"bge-m3 is a better model than nomic-embed-text, so semantic search quality should improve significantly."

Here's what the eval framework showed:

Metric	v0 (nomic-embed-text)	v1 (bge-m3)	Change
Retrieval accuracy	60%	65%	+5%
Function completeness	40%	40%	No change
Context preservation	Poor	Poor	No change

5% improvement in retrieval accuracy. But function completeness—whether search results included full functions instead of fragments—didn't change at all.

Why?

The Revelation: Chunking Was the Bottleneck

Digging into the eval results, a pattern emerged:

~40% of search results were incomplete functions.

Example:

// Query: "find authentication middleware"
// Retrieved chunk (lines 463-975):

    const token = req.headers.authorization?.split(' ')[1];
    if (!token) {
      return res.status(401).json({ error: 'No token provided' });
    }
    // ... rest of function body
  }
}

This chunk is mid-function. No function signature. No context about what this code does. Just a body.

Why? Because line-based chunking doesn't respect function boundaries.

The function started at line 430. The chunk started at line 463. The embedding captured part of the function, but not the semantically meaningful part (the signature, parameters, return type).

And no amount of model sophistication could fix this. A better embedding model can't magically reconstruct context that was lost during chunking.

The Insight: We Were Treating Code Like Text

The problem wasn't the model. It was our assumptions.

Text documents (like articles or books) are mostly linear. Splitting by lines or paragraphs is reasonable. Context flows naturally.

Code is hierarchical. Functions, classes, modules. Splitting by lines ignores this structure.

Example: a 600-line file with 15 functions. Line-based chunking (512 lines per chunk) might produce:

Chunk 1: Functions 1-10 (complete)
Chunk 2: Functions 11-15, but function 11 starts in chunk 1 (split)

Now when you search for function 11, you get incomplete results. The signature is in chunk 1, the body is in chunk 2.

The realization: We needed to chunk by semantic units (functions, classes) instead of lines.

What We Learned from v1

Measure what matters. Without the eval framework, I would've assumed better models = better results. The data revealed the real problem.
Better models ≠ better results if input quality is bad. Garbage in, garbage out—even with state-of-the-art embeddings.
Better models come with tradeoffs. bge-m3 was 10x slower than nomic-embed-text. Quality vs. speed. The "best" model depends on your constraints (local vs. cloud, time vs. accuracy).
Scale and quality are different problems. PostgreSQL + HNSW solved the performance problem. But it didn't solve the quality problem.
Structure matters more than sophistication. Respecting code structure (functions, classes) is more important than using the fanciest model.

Setting the Stage for v2

By early January, the path forward was clear:

We needed AST-based chunking.

Instead of splitting code by lines, we needed to:

Parse code into an Abstract Syntax Tree (AST)
Traverse the AST to identify semantic units (functions, classes, methods)
Chunk by these semantic units instead of lines
Embed complete functions, not arbitrary line ranges

This would ensure:

Search results are complete (full function signatures + bodies)
Embeddings capture semantic meaning (what a function does, not random fragments)
Context is preserved (no mid-function splits)

The tools existed: tree-sitter, a parser generator used by GitHub, Neovim, and others. Fast, incremental, supports 10+ languages.

And while we were at it? We'd solve the embedding performance problem too. Parallel embedding with configurable workers would make even bge-m3 usable for local development.

The questions for v2: How much would AST-based chunking improve quality? And could we make better models fast enough for local use?

That's the story of v2.

Try codetect v1

codetect v1 is available on GitHub:

If you want PostgreSQL-backed semantic search with multi-repo support, v1 is production-ready.

But if you want actually good code search that understands structure...

Next: Part 3 - From Line Chunks to AST-Based Understanding (v2)