
I spent an afternoon investigating local semantic search for AI agent memory. I tested QMD, a local-first semantic search engine created by Tobi Lütke (Shopify founder), and benchmarked it against cloud embeddings. Here's everything I learned — including why it's slow, and how it could be fixed.
What is QMD?
QMD (Query Markdown Documents) is designed for searching your markdown notes, meeting transcripts, and knowledge bases — entirely on your machine with no API calls.
From the README:
"An on-device search engine for everything you need to remember... Ideal for your agentic flows."
The Tech Stack
QMD combines multiple search strategies:
| Component | Technology | Purpose |
|---|---|---|
| Full-text search | SQLite FTS5 (BM25) | Fast keyword matching |
| Vector search | sqlite-vec + local embeddings | Semantic similarity |
| Query expansion | Fine-tuned 1.7B LLM | Rewrites queries into variants |
| Reranking | Qwen3 0.6B LLM | Scores and orders candidates |
All models run locally via node-llama-cpp with GGUF quantized weights.
The Three Search Modes
Here's the first key finding: QMD offers different speed/quality tradeoffs:
| Mode | Command | What It Does | Typical Speed |
|---|---|---|---|
| Keyword Only | search | BM25 full-text search | 0.2s ⚡ |
| Semantic | vsearch | Vector search + query expansion | 2-6s 🔶 |
| Full Pipeline | query | Everything + LLM reranking | 13-16s 🐢 |
That's a 65x difference between fastest and slowest. Same tool, wildly different performance characteristics.
For AI frameworks integrating QMD, the default is typically query — the slowest, highest-quality option.
Where Does the Time Go?

Benchmarked on Mac Mini with 16GB RAM:
- Cold-load query expansion model (1.2 GB): ~5s
- Expand query into 3 variants: ~1s
- Run 6 parallel searches (3 queries × 2 modes): ~1s
- RRF fusion + candidate selection: ~0.1s
- Cold-load reranker model (610 MB): ~4s
- Rerank top 30 candidates: ~2s
- Position-aware blending + output: ~0.1s
The core search is fast (0.17s). The slowness is entirely from cold-loading LLM models.
Models Downloaded (~2.1 GB total)
| Model | Size | Purpose |
|---|---|---|
qmd-query-expansion-1.7B-q4_k_m.gguf | 1.2 GB | Rewrites queries |
qwen3-reranker-0.6b-q8_0.gguf | 610 MB | Scores candidates |
embeddinggemma-300M-Q8_0.gguf | 313 MB | Vector embeddings |
Models auto-download from HuggingFace on first use.
The Root Cause: Cold-Loading Models Per Query
Here's what I discovered: many AI frameworks spawn a new subprocess for every search:
Every search:
- Spawn new
qmd queryprocess - Process loads 1.2 GB expansion model from disk
- Run query expansion
- Load 610 MB reranker model from disk
- Run reranking
- Output results
- Process exits — models discarded
Next search? Start over. Load 2 GB of models again.
This is the architectural issue. QMD itself isn't slow — the integration pattern is wrong.
The Fix: MCP Server Mode
QMD already has an MCP (Model Context Protocol) server mode that keeps models loaded:
qmd mcp # Starts long-lived server with models in memoryThe fix is for AI frameworks to:
- Start
qmd mcpas a long-lived child process on boot - Route search calls through MCP instead of spawning subprocesses
- Models stay warm → sub-second queries
From the QMD docs:
"If you need repeated semantic searches, consider keeping the process/model warm (e.g., a long-lived qmd/MCP server mode) rather than invoking a cold-start LLM each time."
Real-World Comparison
| QMD (Full) | QMD (Keyword) | Cloud Embeddings | |
|---|---|---|---|
| Latency | ~15s | ~0.2s | ~200ms |
| API tokens/search | 0 | 0 | ~100-500 |
| Cost per 1000 searches | $0 | $0 | ~$0.01-0.05 |
| Data leaves machine | No | No | Yes |
| Quality | Highest | Good | High |
Setup Gotchas (Lessons Learned)
If you're integrating QMD with an AI framework, here are the gotchas I hit:
1. Isolated Index Directories Per Agent
QMD often runs with isolated config/cache directories per agent — not your personal ~/.cache/qmd/. To debug the actual index:
export XDG_CONFIG_HOME="/path/to/agent/qmd/xdg-config"
export XDG_CACHE_HOME="/path/to/agent/qmd/xdg-cache"
qmd collection list2. Auto-Created Collections May Point to Wrong Paths
Collections sometimes reference incorrect directories. If searches return nothing, check your index.yml and verify the collection roots actually exist.
3. Embeddings Aren't Auto-Generated
Files get indexed, but embeddings might not be generated automatically. If vector search returns empty results:
qmd embed4. Default Timeout Too Short
Many frameworks default to 4-second timeouts. QMD's full pipeline takes 13-16 seconds. Result: always times out, falls back to cloud embeddings.
The fix: Increase timeout to 60+ seconds, or use a faster search mode.
5. Config Changes May Require Full Restart
Some config changes don't hot-reload. You might need to restart the entire application, not just send a reload signal.
When to Use What
Use Local Search (QMD) When:
- Privacy is non-negotiable (legal, medical, corporate)
- You're doing batch processing (latency doesn't matter)
- API costs are a concern at scale
- You want maximum search quality
- You need offline capability
Use Cloud Embeddings When:
- Interactive chat (sub-second responses needed)
- Occasional memory lookups
- Privacy requirements are moderate
- You want zero local setup
- Hardware resources are limited
The Hybrid Approach:
- Cloud embeddings for interactive queries
- Local search for background processing, nightly indexing, deep analysis
Feature Requests
After this investigation, here's what I think would help:
For AI Frameworks:
- Configurable search mode selection. Config to choose
search/vsearch/querymode - Automatic fallback (try fast, escalate if results are poor)
- Proper timeout handling that matches actual query time
The Bottom Line

QMD is impressive engineering — fully local semantic search that rivals cloud APIs in quality. The problem isn't QMD itself; it's the integration pattern (subprocess per query instead of persistent server).
Current state:
- QMD works correctly but takes 13-16s per query (cold-loading models)
- Default short timeouts mean it often falls back to cloud anyway
- The fix exists (MCP server mode) but isn't widely implemented yet
With MCP server mode: Latency would drop from 15s to sub-second.
Until that's common, cloud embeddings are the pragmatic choice for interactive chat. QMD shines for batch processing, privacy-sensitive use cases, or when you can wait.