When Every Improvement Made Things Worse

This is part 4 of the codetect series:

Part 1: Building an MCP Code Search Tool (v0)
Part 2: When Better Models Aren't Enough (v1)
Part 3: From Line Chunks to AST-Based Understanding (v2)
Part 4: When Every Improvement Made Things Worse ← You are here

I built codetect—a local MCP server that indexes codebases with embeddings so Claude Code can search semantically instead of grepping and reading whole files. The original version (v0) used the simplest possible stack: ctags for symbols, naive line-based chunking with overlap (512 lines, 50-line overlap), nomic-embed-text for embeddings, SQLite for storage, six MCP tools. It showed a measurable ~10% token efficiency improvement over Claude Code's built-in tools.

Then I made it better. Twice.

v1 added PostgreSQL with pgvector for scalable vector search and better embedding models (bge-m3). v2 went further: AST-based chunking for precise function-level boundaries. Reciprocal Rank Fusion to combine multiple search signals. Cross-encoder reranking for precision. Seven MCP tools instead of six.

Performance got worse. Not marginally—I was now performing below the baseline of not using codetect at all.

Here's what happened.

Precise Chunking Removed the Context That Made Results Useful

v0's chunker split files into 512-line blocks with 50 lines of overlap. This was sloppy—chunks would contain half a function, or a struct definition jammed together with an unrelated helper.

But from the LLM's perspective, this sloppiness was a feature.

A chunk would often contain a function and its caller, or a type and the method that constructs it. When the agent searched for "how does auth work," the overlapping chunks provided enough surrounding context to answer the question in one shot.

AST chunking surgically removed this.

Each function became its own perfectly bounded chunk. Imports got split into separate "gap" chunks. The connective tissue between related pieces of code—the thing that made search results actionable—was gone.

Results became more precise and less useful.

// v0 (line-based chunking):
// Chunk includes auth middleware + calling function
package auth

func AuthMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        token := r.Header.Get("Authorization")
        if token == "" {
            http.Error(w, "unauthorized", 401)
            return
        }
        // validate token...
        next.ServeHTTP(w, r)
    })
}

func HandleLogin(w http.ResponseWriter, r *http.Request) {  // ← Context!
    // Uses AuthMiddleware ↑
}

// v2 (AST chunking):
// Chunk 1: Just AuthMiddleware (no context about usage)
// Chunk 2: Just HandleLogin (no context about middleware)
// Chunk 3: Gap chunk with imports (useless noise)

When an agent asked "how does authentication work," v0 returned one chunk with both the middleware definition and its usage. v2's AST chunking returned isolated functions that required multiple follow-up queries to understand.

The Bigger Model Added Latency Without Meaningful Quality Gains

bge-m3 is a 567M parameter model, roughly 4x larger than nomic-embed-text. Every search query embeds at request time, so this directly increases latency.

The quality improvement—trained on MS MARCO and other natural language benchmarks—doesn't translate well to code search. Agent queries tend to be fairly literal: "error handling in auth module," "database connection pooling."

The smaller model was good enough, and the extra capacity was wasted on latency.

Embedding Model	Parameters	Query Latency	Quality Gain
nomic-embed-text (v0)	137M	~50ms	baseline
bge-m3 (v1/v2)	567M	~200ms	~+3%

I added 150ms of latency per query for a ~3% quality improvement on code that agents could already find with literal keyword matches.

PostgreSQL Solved the Wrong Problem

Yes, PostgreSQL + HNSW was dramatically faster at scale—60x faster on 10K vectors (~58ms brute-force in SQLite vs ~1ms with HNSW). That's a real win on paper.

But for a typical individual developer's codebase of 5-20K chunks, SQLite's brute-force ~58ms was already fast enough. The bottleneck was never search latency—it was the quality of what search returned.

PostgreSQL added:

TCP connection overhead
Query planning
HNSW index maintenance
Docker as a dependency

All to serve a workload where the difference between 58ms and 1ms is imperceptible to an agent that spends seconds on each reasoning step.

Worse, HNSW is an approximate nearest neighbor algorithm. It can miss results that brute-force would have found. We traded guaranteed correctness for scalability that most users didn't need.

Storage	Search Time (10K chunks)	Dependencies	Accuracy
SQLite (v0)	~58ms	None	100%
PostgreSQL + HNSW (v1/v2)	~1ms	Docker, pgvector	~98%

PostgreSQL was genuinely faster. But the speed gain was irrelevant to agent workflows where reasoning—not search latency—is the bottleneck. I added complexity and dependencies for a speedup that didn't matter.

RRF Fusion Had a Fatal Bug Hiding in Plain Sight

Reciprocal Rank Fusion (RRF) boosts results that appear in multiple search signals. It's the theoretical justification for combining keyword and semantic search.

But my keyword results had IDs like auth.go:47 (single line), while semantic results had IDs like auth.go:40:60 (chunk range).

The same function, found by both signals, got two different IDs. RRF never recognized them as the same result. The fusion boost—the entire point of RRF—essentially never fired.

What I actually had was weighted interleaving with keyword results suppressed to 0.3 weight. This was strictly worse than v0's approach of simple concatenation.

# What I thought RRF was doing:
# Function "authenticate" found by both keyword + semantic
# → Boosted to top of results

# What actually happened:
keyword_results = [
    {"id": "auth.go:47", "score": 0.9},  # Line 47
]
semantic_results = [
    {"id": "auth.go:40:60", "score": 0.85},  # Lines 40-60
]

# RRF treats these as DIFFERENT results
# → No fusion boost
# → keyword result gets 0.3 weight (suppressed)
# → Worse than v1's simple concat

I spent weeks implementing RRF correctly, but the ID mismatch meant it never worked as intended.

More Tools Meant More Tokens Spent Deciding

Seven MCP tools (up from six in v0)—several overlapping—meant the agent spent tokens on every search just choosing which tool to use.

Should it call search_keyword, search_semantic, hybrid_search, or hybrid_search_v2?

The agent doesn't know, so it reasons about it, picks one, sometimes picks wrong, and tries another. This decision overhead is pure waste that compounds across a session.

Version	MCP Tools	Tokens per Search Decision
v0/v1	6	~80
v2	7	~120

Over a 20-search session, that's 1,400 extra tokens just deciding which tool to use—tokens that could have been spent on actual reasoning.

Per-Request Initialization Added Invisible Overhead

Every search call opened a database, instantiated caches, created index structures, initialized an embedder, ran the query, and tore everything down.

The MCP server is long-lived—these should have been initialized once at startup.

With the v2 stack's added complexity, this fixed cost per call grew from negligible to noticeable:

Version	Init Overhead	Impact on 10 Searches
v0	~5ms	50ms
v2	~80ms	800ms

Nearly a full second wasted across 10 queries just reinitializing the same resources.

The Meta-Lesson

Every change was individually defensible:

AST chunking is more correct
bge-m3 scores higher on benchmarks
PostgreSQL is more scalable
RRF is theoretically superior
Content-addressed caching is architecturally cleaner

But the metric that matters is total tokens consumed to go from question to correct answer.

None of the improvements were evaluated against that metric. I optimized for engineering elegance—precision, scalability, correctness—when I should have optimized for the actual consumer: an LLM that benefits from over-fetching, is harmed by latency, and loses tokens to every unnecessary decision point.

The original version (v0) worked because it was fast, dumb, and returned too much context rather than too little.

v1 and v2 made it slow, smart, and precise—exactly wrong for agent consumption.

What We Should Have Measured

The right metric was staring us in the face the whole time:

Tokens-to-answer: Total tokens consumed (including failed searches, retries, and follow-ups) to successfully complete a task.

Here's what our eval framework showed across an 18-task test suite:

Metric	v0 (baseline)	v2 (without context enrichment)	v2 (with enrichment)
Total tokens	216,542	282,780	202,548
vs baseline	—	+24% worse	-6.5% better
Accuracy (F1)	65.6%	—	67.7%
Avg latency	20.4s	—	38.3s (+87.5%)

Without context enrichment (which I had to scramble to add), v2 was 24% worse than not using codetect at all. Even with enrichment, the latency nearly doubled—the agent spent 87% more time per task, mostly on the overhead I'd built into the system.

v0's sloppy chunks meant agents got their answers with less overhead. v2's precise chunks forced agents to make more queries to reconstruct the context that v0 provided automatically.

Where We Are Now: v3

After v2's failures became clear, I focused v3 on the metric that actually matters: tokens-to-answer.

v3 (shipped): Instead of reverting to v0's stack, I kept v2's infrastructure but ruthlessly cut everything that added token overhead. Reduced tools from 7 to 4. Added server-lifetime initialization (eliminating per-request overhead). Compressed tool descriptions to reduce system prompt tokens. Added detail levels so agents could request less context when they didn't need it.

The results:

Metric	v2.2.x	v3.0.0
Accuracy (F1)	67.7%	85.7%
Token overhead vs baseline	-6.5% (with enrichment)	-1.5%
Latency overhead	+87.5%	+0.3%
MCP tools	7	4

v3 wins 10 out of 12 head-to-head comparisons against baseline (no MCP tools). The token overhead is essentially zero, and the latency penalty is gone.

The realization: I didn't need to throw away v2's infrastructure. I needed to optimize it for the actual consumer—an LLM that benefits from fast, focused results with minimal decision overhead.

The principles that guided v3 (and will guide future work):

Fewer tools, clearer choices - 4 tools instead of 7. Less decision overhead per search.
Server-lifetime initialization - No per-request overhead. Initialize once, serve many.
Measure tokens-to-answer - The only KPI that actually matters for agent consumption.
Context enrichment by default - Don't return isolated chunks. Include surrounding context.
Speed over precision - A fast approximate answer beats a slow precise one for agent workflows.

If it doesn't help agents find answers faster with fewer tokens, it doesn't ship.

Lessons for Building LLM Tools

1. Optimize for the consumer, not the abstraction

LLMs benefit from redundant context. Over-fetching is a feature, not a bug. Precision is overrated when the consumer can filter noise better than you can provide signal.

2. Latency compounds in agent workflows

A human might tolerate 200ms. An agent making 20 queries in a session loses 3 seconds to latency alone—3 seconds where no reasoning happens.

3. Measure tokens-to-answer, not search quality

Traditional IR metrics (precision, recall, MRR) don't capture agent workflows. A precise search that requires 3 follow-up queries is worse than a sloppy search that over-fetches context in one shot.

4. Simplicity scales better than sophistication

Seven overlapping tools create decision paralysis. Three clear tools with obvious use cases keep agents focused on reasoning, not tool selection.

5. Infrastructure for scale you don't have is overhead

PostgreSQL is great at 100K+ chunks. At 10K chunks, it's slower and harder to deploy than SQLite. Build for the scale you have, not the scale you imagine.

6. Every abstraction has a tax

Content-addressed caching, dimension-grouped tables, cross-encoder reranking—each adds initialization cost, decision complexity, and failure modes. The tax is worth it only if the benefit is measurable.

Final Thoughts

This wasn't a story about bad engineering. The v2 implementation was correct, well-tested, and architecturally sound.

It was a story about optimizing for the wrong thing.

I optimized for engineering values—precision, scalability, correctness—when I should have optimized for agent values—context, speed, simplicity.

The hardest part of building tools for LLMs isn't the infrastructure. It's remembering that the consumer isn't human.

Humans want precision. Agents want context.

Humans tolerate latency. Agents compound it.

Humans appreciate choices. Agents waste tokens deciding.

v0 worked because it was built for what agents actually need: fast, dumb, and generous with context.

v1 and v2 were explorations in scaling and precision that taught hard lessons. v3 took those lessons and applied them to what actually matters: helping agents find answers with fewer tokens.

Sometimes the path forward is backward.

Try codetect

codetect is open source and available on GitHub:

If you're building LLM tools and want to talk about optimization tradeoffs, I'm on GitHub and LinkedIn.