codetect v0: Building an MCP Code Search Tool from Scratch

This is part 1 of the codetect series:

Part 1: Building an MCP Code Search Tool (v0) ← You are here
Part 2: When Better Models Aren't Enough (v1)
Part 3: From Line Chunks to AST-Based Understanding (v2)

The Problem

November 2025. I'm deep into agentic AI-assisted development, and I notice two things happening simultaneously:

My Claude API bills are climbing fast. Every time I ask Claude about my codebase, I'm sending entire files—sometimes multiple files—just to get context. A simple "where is the authentication logic?" question could cost 10-20k tokens.
People keep saying Cursor feels faster than Claude Code. Coworkers, friends on Twitter, developers in Discord—everyone's making the same observation.

So I dig into why Cursor feels different. The answer isn't the underlying model (both can use Claude). It's the codebase indexing.

Cursor indexes your codebase and retrieves only relevant chunks when you ask questions. Instead of sending 5,000 lines of code, it sends 500. Instead of 20k tokens, maybe 2k. Same answer, 90% less cost.

But here's the catch: Cursor's indexing is proprietary and cloud-based. If you want that capability with other tools—Claude Code, Continue, custom LLM workflows—you're out of luck.

That's when I decided to build it myself.

(I originally called it repo-search—a literal, functional name. By v1, I'd rename it to codetect to better reflect its purpose: detecting code patterns through multiple search modes. But that's getting ahead of the story.)

The Goal

I wanted a tool that:

Works with any LLM that supports MCP (Model Context Protocol) - not just one vendor
Runs entirely locally - no cloud dependencies, no API costs, no data leaving your machine
Provides three search modes:
- Keyword search (fast full-text)
- Symbol navigation (jump to definitions)
- Semantic search (find code by meaning, not just keywords)
Is fast enough for real development - sub-second search on typical codebases
Is open source - MIT license, community-driven

The North Star: Bring Cursor-grade code search to the entire MCP ecosystem.

The MVP Approach: Ship Fast, Learn Fast

I gave myself a constraint: get something working in weeks, not months.

Here's what I chose for v0:

Technology Stack

SQLite: Simple, fast, single-file database. Perfect for MVP.
ripgrep: Blazing-fast keyword search (already on most dev machines).
ctags: Battle-tested symbol indexing for function/class definitions.
Ollama + nomic-embed-text: Local embeddings (768 dimensions) without cloud API costs.

Chunking Strategy

This is where I took the simplest possible approach:

Line-based chunking: Split files every 512 lines
50-line overlap: Overlap chunks to avoid losing context at boundaries
One chunk = one embedding: Simple 1:1 mapping

I knew this was naive. Functions would get split across chunks. Context would be lost. But it was fast to implement and would validate the core idea.

MCP Server Architecture

codetect runs as an MCP server that exposes three tools to any compatible LLM:

// Three search modes exposed via MCP
{
  "keyword_search": {
    "description": "Fast full-text search using ripgrep",
    "parameters": { "query": "string" }
  },
  "symbol_search": {
    "description": "Find function/class definitions using ctags",
    "parameters": { "symbol": "string" }
  },
  "semantic_search": {
    "description": "Find code by meaning using embeddings",
    "parameters": { "query": "string", "limit": "number" }
  }
}

The LLM decides which tool to use based on the user's query. Want to find "authentication logic"? Semantic search. Want to jump to the validateToken function? Symbol search. Want to grep for "TODO"? Keyword search.

Implementation: The First Two Weeks

Week 1: Basic indexing

File traversal (ignore .git, node_modules, etc.)
Line-based chunking
SQLite schema for chunks and embeddings
Integration with Ollama for local embedding generation

Week 2: MCP server + search

MCP server implementation
Vector similarity search (cosine similarity in SQLite)
ripgrep and ctags integration
Basic CLI (codetect init, codetect index, codetect search)

By mid-November, I had a working prototype. It wasn't pretty, but it worked.

What Worked

1. MCP integration was seamless. The Model Context Protocol made it trivial to expose codetect's capabilities to Claude Code. Once the server was running, Claude could call search functions naturally.

2. Local embeddings were viable. Ollama's nomic-embed-text was fast enough (10-50 chunks/second) and accurate enough for code search. No cloud API costs, no rate limits.

3. Keyword + symbol search were immediately useful. Even without semantic search, ripgrep and ctags provided 80% of the value. Developers use keyword search constantly.

4. Shipping fast enabled real feedback. Within days of launching v0, I was using it daily. That usage revealed what needed to improve.

What Didn't Work

1. Line-based chunking split functions awkwardly.

Example: a 600-line file with a function starting at line 480. The function gets split across two chunks—signature in one, body in another. When semantic search retrieves the second chunk, you get a function body with no context.

2. Semantic search quality was inconsistent.

Sometimes it was brilliant: "find rate limiting logic" → correct results. Sometimes it was terrible: "find database connection setup" → random chunks mentioning "database" but not the actual connection code.

Why? Because chunks weren't semantic units. They were arbitrary line ranges.

3. No incremental updates.

Every code change meant a full re-index. For small projects (100-500 files), this was fine (30-60 seconds). For larger codebases? Painful.

4. Single-repo limitation.

v0 assumed one repo per database. If you worked on multiple projects, you needed multiple databases. Not ideal for organizations.

Key Learnings

Ship fast, learn fast. v0's line-based chunking was naive, but getting it out quickly validated the core idea and revealed what to fix.
Local embeddings are viable. You don't need cloud APIs for code search. Local models are fast enough and good enough.
Code has structure. Line-based chunking ignores that structure. This became the key insight for v1 and v2.
MCP is powerful. Exposing search as MCP tools made integration trivial. Any MCP-compatible LLM could use codetect without modification.

What's Next

v0 (still called repo-search at this point) proved the concept. But it also revealed the path forward:

Scale to larger codebases - PostgreSQL + pgvector for production performance
Better embedding models - Try bge-m3, mxbai-embed-large (higher dimensions)
Multi-repo support - Centralized database for organizations
Incremental indexing - Don't re-index everything on every change
Eval framework - Measure quality improvements objectively
Better name - "repo-search" was too generic; the tool deserved a name that reflected its multi-modal search capabilities

But the biggest question: Can we fix chunking?

That's the story of v1—where repo-search became codetect.

Try codetect v0

codetect is open source and available on GitHub:

If you're working with Claude Code or other MCP-compatible tools and want local-first code search, give it a try.

Next: Part 2 - When Better Models Aren't Enough (v1)