Local Semantic Search: A Deep Dive into Speed vs Privacy

I spent an afternoon investigating local semantic search for AI agent memory. I tested QMD, a local-first semantic search engine created by Tobi Lütke (Shopify founder), and benchmarked it against cloud embeddings. Here's everything I learned — including why it's slow, and how it could be fixed.

What is QMD?

QMD (Query Markdown Documents) is designed for searching your markdown notes, meeting transcripts, and knowledge bases — entirely on your machine with no API calls.

From the README:

"An on-device search engine for everything you need to remember... Ideal for your agentic flows."

The Tech Stack

QMD combines multiple search strategies:

Component	Technology	Purpose
Full-text search	SQLite FTS5 (BM25)	Fast keyword matching
Vector search	sqlite-vec + local embeddings	Semantic similarity
Query expansion	Fine-tuned 1.7B LLM	Rewrites queries into variants
Reranking	Qwen3 0.6B LLM	Scores and orders candidates

All models run locally via node-llama-cpp with GGUF quantized weights.

The Three Search Modes

Here's the first key finding: QMD offers different speed/quality tradeoffs:

Mode	Command	What It Does	Typical Speed
Keyword Only	`search`	BM25 full-text search	0.2s ⚡
Semantic	`vsearch`	Vector search + query expansion	2-6s 🔶
Full Pipeline	`query`	Everything + LLM reranking	13-16s 🐢

That's a 65x difference between fastest and slowest. Same tool, wildly different performance characteristics.

For AI frameworks integrating QMD, the default is typically query — the slowest, highest-quality option.

Where Does the Time Go?

Benchmarked on Mac Mini with 16GB RAM:

Cold-load query expansion model (1.2 GB): ~5s
Expand query into 3 variants: ~1s
Run 6 parallel searches (3 queries × 2 modes): ~1s
RRF fusion + candidate selection: ~0.1s
Cold-load reranker model (610 MB): ~4s
Rerank top 30 candidates: ~2s
Position-aware blending + output: ~0.1s

The core search is fast (0.17s). The slowness is entirely from cold-loading LLM models.

Models Downloaded (~2.1 GB total)

Model	Size	Purpose
`qmd-query-expansion-1.7B-q4_k_m.gguf`	1.2 GB	Rewrites queries
`qwen3-reranker-0.6b-q8_0.gguf`	610 MB	Scores candidates
`embeddinggemma-300M-Q8_0.gguf`	313 MB	Vector embeddings

Models auto-download from HuggingFace on first use.

The Root Cause: Cold-Loading Models Per Query

Here's what I discovered: many AI frameworks spawn a new subprocess for every search:

Every search:

Spawn new qmd query process
Process loads 1.2 GB expansion model from disk
Run query expansion
Load 610 MB reranker model from disk
Run reranking
Output results
Process exits — models discarded

Next search? Start over. Load 2 GB of models again.

This is the architectural issue. QMD itself isn't slow — the integration pattern is wrong.

The Fix: MCP Server Mode

QMD already has an MCP (Model Context Protocol) server mode that keeps models loaded:

qmd mcp  # Starts long-lived server with models in memory

The fix is for AI frameworks to:

Start qmd mcp as a long-lived child process on boot
Route search calls through MCP instead of spawning subprocesses
Models stay warm → sub-second queries

From the QMD docs:

"If you need repeated semantic searches, consider keeping the process/model warm (e.g., a long-lived qmd/MCP server mode) rather than invoking a cold-start LLM each time."

Real-World Comparison

	QMD (Full)	QMD (Keyword)	Cloud Embeddings
Latency	~15s	~0.2s	~200ms
API tokens/search	0	0	~100-500
Cost per 1000 searches	$0	$0	~$0.01-0.05
Data leaves machine	No	No	Yes
Quality	Highest	Good	High

Setup Gotchas (Lessons Learned)

If you're integrating QMD with an AI framework, here are the gotchas I hit:

1. Isolated Index Directories Per Agent

QMD often runs with isolated config/cache directories per agent — not your personal ~/.cache/qmd/. To debug the actual index:

export XDG_CONFIG_HOME="/path/to/agent/qmd/xdg-config"
export XDG_CACHE_HOME="/path/to/agent/qmd/xdg-cache"
qmd collection list

2. Auto-Created Collections May Point to Wrong Paths

Collections sometimes reference incorrect directories. If searches return nothing, check your index.yml and verify the collection roots actually exist.

3. Embeddings Aren't Auto-Generated

Files get indexed, but embeddings might not be generated automatically. If vector search returns empty results:

qmd embed

4. Default Timeout Too Short

Many frameworks default to 4-second timeouts. QMD's full pipeline takes 13-16 seconds. Result: always times out, falls back to cloud embeddings.

The fix: Increase timeout to 60+ seconds, or use a faster search mode.

5. Config Changes May Require Full Restart

Some config changes don't hot-reload. You might need to restart the entire application, not just send a reload signal.

When to Use What

Use Local Search (QMD) When:

Privacy is non-negotiable (legal, medical, corporate)
You're doing batch processing (latency doesn't matter)
API costs are a concern at scale
You want maximum search quality
You need offline capability

Use Cloud Embeddings When:

Interactive chat (sub-second responses needed)
Occasional memory lookups
Privacy requirements are moderate
You want zero local setup
Hardware resources are limited

The Hybrid Approach:

Cloud embeddings for interactive queries
Local search for background processing, nightly indexing, deep analysis

Feature Requests

After this investigation, here's what I think would help:

For AI Frameworks:

Configurable search mode selection. Config to choose search/vsearch/query mode
Automatic fallback (try fast, escalate if results are poor)
Proper timeout handling that matches actual query time

The Bottom Line

QMD is impressive engineering — fully local semantic search that rivals cloud APIs in quality. The problem isn't QMD itself; it's the integration pattern (subprocess per query instead of persistent server).

Current state:

QMD works correctly but takes 13-16s per query (cold-loading models)
Default short timeouts mean it often falls back to cloud anyway
The fix exists (MCP server mode) but isn't widely implemented yet

With MCP server mode: Latency would drop from 15s to sub-second.

Until that's common, cloud embeddings are the pragmatic choice for interactive chat. QMD shines for batch processing, privacy-sensitive use cases, or when you can wait.