← Back to Omnioculars

Local Semantic Search: A Deep Dive into Speed vs Privacy

5 min read
aisemantic-searchprivacylocal-firstdeveloper-toolsqmd

I spent an afternoon investigating local semantic search for AI agent memory. I tested QMD, a local-first semantic search engine created by Tobi Lütke (Shopify founder), and benchmarked it against cloud embeddings. Here's everything I learned — including why it's slow, and how it could be fixed.

What is QMD?

QMD (Query Markdown Documents) is designed for searching your markdown notes, meeting transcripts, and knowledge bases — entirely on your machine with no API calls.

From the README:

"An on-device search engine for everything you need to remember... Ideal for your agentic flows."

The Tech Stack

QMD combines multiple search strategies:

ComponentTechnologyPurpose
Full-text searchSQLite FTS5 (BM25)Fast keyword matching
Vector searchsqlite-vec + local embeddingsSemantic similarity
Query expansionFine-tuned 1.7B LLMRewrites queries into variants
RerankingQwen3 0.6B LLMScores and orders candidates

All models run locally via node-llama-cpp with GGUF quantized weights.

The Three Search Modes

Here's the first key finding: QMD offers different speed/quality tradeoffs:

ModeCommandWhat It DoesTypical Speed
Keyword OnlysearchBM25 full-text search0.2s
SemanticvsearchVector search + query expansion2-6s 🔶
Full PipelinequeryEverything + LLM reranking13-16s 🐢

That's a 65x difference between fastest and slowest. Same tool, wildly different performance characteristics.

For AI frameworks integrating QMD, the default is typically query — the slowest, highest-quality option.

Where Does the Time Go?

Benchmarked on Mac Mini with 16GB RAM:

  • Cold-load query expansion model (1.2 GB): ~5s
  • Expand query into 3 variants: ~1s
  • Run 6 parallel searches (3 queries × 2 modes): ~1s
  • RRF fusion + candidate selection: ~0.1s
  • Cold-load reranker model (610 MB): ~4s
  • Rerank top 30 candidates: ~2s
  • Position-aware blending + output: ~0.1s

The core search is fast (0.17s). The slowness is entirely from cold-loading LLM models.

Models Downloaded (~2.1 GB total)

ModelSizePurpose
qmd-query-expansion-1.7B-q4_k_m.gguf1.2 GBRewrites queries
qwen3-reranker-0.6b-q8_0.gguf610 MBScores candidates
embeddinggemma-300M-Q8_0.gguf313 MBVector embeddings

Models auto-download from HuggingFace on first use.

The Root Cause: Cold-Loading Models Per Query

Here's what I discovered: many AI frameworks spawn a new subprocess for every search:

Every search:

  1. Spawn new qmd query process
  2. Process loads 1.2 GB expansion model from disk
  3. Run query expansion
  4. Load 610 MB reranker model from disk
  5. Run reranking
  6. Output results
  7. Process exits — models discarded

Next search? Start over. Load 2 GB of models again.

This is the architectural issue. QMD itself isn't slow — the integration pattern is wrong.

The Fix: MCP Server Mode

QMD already has an MCP (Model Context Protocol) server mode that keeps models loaded:

qmd mcp  # Starts long-lived server with models in memory

The fix is for AI frameworks to:

  1. Start qmd mcp as a long-lived child process on boot
  2. Route search calls through MCP instead of spawning subprocesses
  3. Models stay warm → sub-second queries

From the QMD docs:

"If you need repeated semantic searches, consider keeping the process/model warm (e.g., a long-lived qmd/MCP server mode) rather than invoking a cold-start LLM each time."

Real-World Comparison

QMD (Full)QMD (Keyword)Cloud Embeddings
Latency~15s~0.2s~200ms
API tokens/search00~100-500
Cost per 1000 searches$0$0~$0.01-0.05
Data leaves machineNoNoYes
QualityHighestGoodHigh

Setup Gotchas (Lessons Learned)

If you're integrating QMD with an AI framework, here are the gotchas I hit:

1. Isolated Index Directories Per Agent

QMD often runs with isolated config/cache directories per agent — not your personal ~/.cache/qmd/. To debug the actual index:

export XDG_CONFIG_HOME="/path/to/agent/qmd/xdg-config"
export XDG_CACHE_HOME="/path/to/agent/qmd/xdg-cache"
qmd collection list

2. Auto-Created Collections May Point to Wrong Paths

Collections sometimes reference incorrect directories. If searches return nothing, check your index.yml and verify the collection roots actually exist.

3. Embeddings Aren't Auto-Generated

Files get indexed, but embeddings might not be generated automatically. If vector search returns empty results:

qmd embed

4. Default Timeout Too Short

Many frameworks default to 4-second timeouts. QMD's full pipeline takes 13-16 seconds. Result: always times out, falls back to cloud embeddings.

The fix: Increase timeout to 60+ seconds, or use a faster search mode.

5. Config Changes May Require Full Restart

Some config changes don't hot-reload. You might need to restart the entire application, not just send a reload signal.

When to Use What

Use Local Search (QMD) When:

  • Privacy is non-negotiable (legal, medical, corporate)
  • You're doing batch processing (latency doesn't matter)
  • API costs are a concern at scale
  • You want maximum search quality
  • You need offline capability

Use Cloud Embeddings When:

  • Interactive chat (sub-second responses needed)
  • Occasional memory lookups
  • Privacy requirements are moderate
  • You want zero local setup
  • Hardware resources are limited

The Hybrid Approach:

  • Cloud embeddings for interactive queries
  • Local search for background processing, nightly indexing, deep analysis

Feature Requests

After this investigation, here's what I think would help:

For AI Frameworks:

  • Configurable search mode selection. Config to choose search/vsearch/query mode
  • Automatic fallback (try fast, escalate if results are poor)
  • Proper timeout handling that matches actual query time

The Bottom Line

QMD is impressive engineering — fully local semantic search that rivals cloud APIs in quality. The problem isn't QMD itself; it's the integration pattern (subprocess per query instead of persistent server).

Current state:

  • QMD works correctly but takes 13-16s per query (cold-loading models)
  • Default short timeouts mean it often falls back to cloud anyway
  • The fix exists (MCP server mode) but isn't widely implemented yet

With MCP server mode: Latency would drop from 15s to sub-second.

Until that's common, cloud embeddings are the pragmatic choice for interactive chat. QMD shines for batch processing, privacy-sensitive use cases, or when you can wait.