When Every Improvement Made Things Worse

I built codetect—a local MCP server that indexes codebases with embeddings so Claude Code can search semantically instead of grepping and reading whole files. Version 1 used the simplest possible stack: ctags for symbols, naive line-based chunking with overlap (512 lines, 50-line overlap), nomic-embed-text for embeddings, SQLite for storage. It showed a measurable 10-20% token efficiency improvement over Claude Code's built-in tools.

Then I made it better.

AST-based chunking for precise function-level boundaries. bge-m3 for higher-quality embeddings. PostgreSQL with pgvector for scalable vector search. Reciprocal Rank Fusion to combine multiple search signals. Cross-encoder reranking for precision. Seven MCP tools instead of four.

Performance got worse. Not marginally—I was now performing below the baseline of not using codetect at all.

Here's what happened.

Precise Chunking Removed the Context That Made Results Useful

The old chunker split files into 512-line blocks with 50 lines of overlap. This was sloppy—chunks would contain half a function, or a struct definition jammed together with an unrelated helper.

But from the LLM's perspective, this sloppiness was a feature.

A chunk would often contain a function and its caller, or a type and the method that constructs it. When the agent searched for "how does auth work," the overlapping chunks provided enough surrounding context to answer the question in one shot.

AST chunking surgically removed this.

Each function became its own perfectly bounded chunk. Imports got split into separate "gap" chunks. The connective tissue between related pieces of code—the thing that made search results actionable—was gone.

Results became more precise and less useful.

// v1 (line-based chunking):
// Chunk includes auth middleware + calling function
package auth

func AuthMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        token := r.Header.Get("Authorization")
        if token == "" {
            http.Error(w, "unauthorized", 401)
            return
        }
        // validate token...
        next.ServeHTTP(w, r)
    })
}

func HandleLogin(w http.ResponseWriter, r *http.Request) {  // ← Context!
    // Uses AuthMiddleware ↑
}

// v2 attempt (AST chunking):
// Chunk 1: Just AuthMiddleware (no context about usage)
// Chunk 2: Just HandleLogin (no context about middleware)
// Chunk 3: Gap chunk with imports (useless noise)

When an agent asked "how does authentication work," the old v1 returned one chunk with both the middleware definition and its usage. My new AST-based v2 returned isolated functions that required multiple follow-up queries to understand.

The Bigger Model Added Latency Without Meaningful Quality Gains

bge-m3 is a 567M parameter model, roughly 4x larger than nomic-embed-text. Every search query embeds at request time, so this directly increases latency.

The quality improvement—trained on MS MARCO and other natural language benchmarks—doesn't translate well to code search. Agent queries tend to be fairly literal: "error handling in auth module," "database connection pooling."

The smaller model was good enough, and the extra capacity was wasted on latency.

Embedding Model	Parameters	Query Latency	Quality Gain
nomic-embed-text (v1)	137M	~50ms	baseline
bge-m3 (v2 attempt)	567M	~200ms	+2-3%

I added 150ms of latency per query for a 2-3% quality improvement on code that agents could already find with literal keyword matches.

PostgreSQL Solved a Problem We Didn't Have

SQLite brute-force vector search is O(n), but for a typical codebase of 5-20k chunks, "O(n) dot products in memory" takes single-digit milliseconds.

PostgreSQL added:

TCP connection overhead
Query planning
HNSW index maintenance
Docker as a dependency

All to serve a workload that fits comfortably in a single SQLite file opened in microseconds.

Worse, HNSW is an approximate nearest neighbor algorithm. It can miss results that brute-force would have found. We traded guaranteed correctness for scalability we didn't need.

Storage	Search Time (10K chunks)	Dependencies	Accuracy
SQLite (v1)	8ms	None	100%
PostgreSQL + HNSW (v2 attempt)	12ms	Docker, pgvector	~98%

I added complexity, dependencies, and approximate results to make 8ms queries run in 12ms.

RRF Fusion Had a Fatal Bug Hiding in Plain Sight

Reciprocal Rank Fusion (RRF) boosts results that appear in multiple search signals. It's the theoretical justification for combining keyword and semantic search.

But my keyword results had IDs like auth.go:47 (single line), while semantic results had IDs like auth.go:40:60 (chunk range).

The same function, found by both signals, got two different IDs. RRF never recognized them as the same result. The fusion boost—the entire point of RRF—essentially never fired.

What I actually had was weighted interleaving with keyword results suppressed to 0.3 weight. This was strictly worse than the v1 approach of simple concatenation.

# What I thought RRF was doing:
# Function "authenticate" found by both keyword + semantic
# → Boosted to top of results

# What actually happened:
keyword_results = [
    {"id": "auth.go:47", "score": 0.9},  # Line 47
]
semantic_results = [
    {"id": "auth.go:40:60", "score": 0.85},  # Lines 40-60
]

# RRF treats these as DIFFERENT results
# → No fusion boost
# → keyword result gets 0.3 weight (suppressed)
# → Worse than v1's simple concat

I spent weeks implementing RRF correctly, but the ID mismatch meant it never worked as intended.

More Tools Meant More Tokens Spent Deciding

Seven MCP tools—several overlapping—meant the agent spent tokens on every search just choosing which tool to use.

Should it call search_keyword, search_semantic, hybrid_search, or hybrid_search_v2?

The agent doesn't know, so it reasons about it, picks one, sometimes picks wrong, and tries another. This decision overhead is pure waste that compounds across a session.

Version	MCP Tools	Tokens per Search Decision
v1	4	~50
v2 attempt	7	~120

Over a 20-search session, that's 1,400 extra tokens just deciding which tool to use—tokens that could have been spent on actual reasoning.

Per-Request Initialization Added Invisible Overhead

Every search call opened a database, instantiated caches, created index structures, initialized an embedder, ran the query, and tore everything down.

The MCP server is long-lived—these should have been initialized once at startup.

With the v2 stack's added complexity, this fixed cost per call grew from negligible to noticeable:

Version	Init Overhead	Impact on 10 Searches
v1	~5ms	50ms
v2 attempt	~80ms	800ms

Nearly a full second wasted across 10 queries just reinitializing the same resources.

The Meta-Lesson

Every change was individually defensible:

AST chunking is more correct
bge-m3 scores higher on benchmarks
PostgreSQL is more scalable
RRF is theoretically superior
Content-addressed caching is architecturally cleaner

But the metric that matters is total tokens consumed to go from question to correct answer.

None of the improvements were evaluated against that metric. I optimized for engineering elegance—precision, scalability, correctness—when I should have optimized for the actual consumer: an LLM that benefits from over-fetching, is harmed by latency, and loses tokens to every unnecessary decision point.

Version 1 worked because it was fast, dumb, and returned too much context rather than too little.

The upgrades made it slow, smart, and precise—exactly wrong for agent consumption.

What We Should Have Measured

The right metric was staring us in the face the whole time:

Tokens-to-answer: Total tokens consumed (including failed searches, retries, and follow-ups) to successfully complete a task.

Here's what actually happened in production:

Task	v1	v2 attempt	Change
"Find auth middleware"	1 search, 3.2K tokens	3 searches, 8.1K tokens	+153%
"How does DB connection pooling work"	2 searches, 5.8K tokens	5 searches, 14.2K tokens	+145%
"Fix error handling in API routes"	3 searches, 9.1K tokens	7 searches, 22.3K tokens	+145%

Every "improvement" increased the token cost by roughly 2.5x.

v1's sloppy chunks meant agents got their answers in one shot. My v2 attempt's precise chunks forced agents to make multiple queries to reconstruct the context that v1 provided automatically.

Where We Are Now: v3 and the Path to v4

After v2's failures became clear, I doubled down in the wrong direction.

v3 (beta): Instead of reverting, I pushed forward. v3 removed all v1 code—the backwards compatibility, the simple approaches that actually worked. The idea was architectural purity: one clean AST-based system without the legacy cruft.

But removing v1 meant removing the escape hatch. v3 is faster to maintain and more elegant to reason about, but it inherited all of v2's problems: precise chunks that destroy context, decisions optimized for engineering elegance instead of agent consumption.

The realization: I've been optimizing the wrong metric this entire time.

What actually matters isn't chunking precision, model quality, or architectural cleanliness. It's tokens-to-answer—how many tokens does an agent burn to go from question to correct, actionable answer?

v1 won on this metric because it was sloppy in exactly the right ways. Overlapping chunks meant over-fetching context. Line-based boundaries meant functions came with their callers. The "mess" was actually signal.

The path forward: v4

I'm now considering a v4 that takes v2 and v3's hard-won lessons and applies them to v1's philosophy:

Generous overlap - Not 50 lines. Maybe 100-150. Over-fetch context by default.
Line-based chunking - But smarter: expand chunk boundaries to include complete function signatures
Fast embeddings - nomic-embed-text or similar. Speed > precision for agent workflows
SQLite - Zero dependencies, brute-force search is fast enough
Three tools, not seven - keyword, semantic, hybrid. Clear choices, less decision overhead
Server-lifetime initialization - No per-request overhead
Measure tokens-to-answer - The only KPI that actually matters for agent consumption

The goal: combine v1's agent-friendly "sloppiness" with v2's performance optimizations and v3's clean architecture—but only the parts that reduce tokens-to-answer.

If it doesn't help agents find answers faster with fewer tokens, it doesn't ship.

Lessons for Building LLM Tools

1. Optimize for the consumer, not the abstraction

LLMs benefit from redundant context. Over-fetching is a feature, not a bug. Precision is overrated when the consumer can filter noise better than you can provide signal.

2. Latency compounds in agent workflows

A human might tolerate 200ms. An agent making 20 queries in a session loses 3 seconds to latency alone—3 seconds where no reasoning happens.

3. Measure tokens-to-answer, not search quality

Traditional IR metrics (precision, recall, MRR) don't capture agent workflows. An 85% precision search that requires 3 follow-up queries is worse than a 60% precision search that over-fetches context.

4. Simplicity scales better than sophistication

Seven overlapping tools create decision paralysis. Three clear tools with obvious use cases keep agents focused on reasoning, not tool selection.

5. Infrastructure for scale you don't have is overhead

PostgreSQL is great at 100K+ chunks. At 10K chunks, it's slower and harder to deploy than SQLite. Build for the scale you have, not the scale you imagine.

6. Every abstraction has a tax

Content-addressed caching, dimension-grouped tables, cross-encoder reranking—each adds initialization cost, decision complexity, and failure modes. The tax is worth it only if the benefit is measurable.

Final Thoughts

This wasn't a story about bad engineering. The v2 implementation was correct, well-tested, and architecturally sound.

It was a story about optimizing for the wrong thing.

I optimized for engineering values—precision, scalability, correctness—when I should have optimized for agent values—context, speed, simplicity.

The hardest part of building tools for LLMs isn't the infrastructure. It's remembering that the consumer isn't human.

Humans want precision. Agents want context.

Humans tolerate latency. Agents compound it.

Humans appreciate choices. Agents waste tokens deciding.

Version 1 worked because it was built for what agents actually need: fast, dumb, and generous with context.

That's the version I should have kept improving. v2 and v3 were architectural exercises. v4 will be a return to first principles—optimizing for the only metric that matters: helping agents find answers with fewer tokens.

Sometimes the path forward is backward.

Try codetect

codetect is open source and available on GitHub:

If you're building LLM tools and want to talk about optimization tradeoffs, I'm on GitHub and LinkedIn.