Semantic Similarity Caching in Go: A Toy Implementation
Instead of keying cache entries on exact string equality, this cache matches requests by embedding similarity. Queries that are semantically equivalent (but syntactically different) can reuse cached results.
Check out this project here!
Motivation
Why Go? I chose Go deliberately for this experiment. The goal was not to build a feature-rich system, but a cache whose performance characteristics are easy to reason about: predictable memory usage, low-latency primitives, and simple concurrency. Go makes it straightforward to inspect where time and allocations go, which matters when evaluating whether a microsecond-scale optimization is actually worth it.
Traditional caches rely on exact string equality. This works well for deterministic inputs, but it breaks down for natural‑language queries. Two questions like “What’s the weather like?” and “How’s the weather today?” are semantically equivalent, yet an exact‑match cache treats them as unrelated keys. In an LLM-backed system, that translates directly into redundant and expensive model calls.
Recent work such as GPT Semantic Cache (arXiv:2411.05276) formalizes an alternative: cache not just keys, but meanings. Queries are embedded into a vector space, indexed with approximate nearest‑neighbor (ANN) search, and matched by cosine similarity rather than exact equality. In their experiments, the authors report cache hit rates in the 60–70% range while preserving high response accuracy.
This project is a deliberately small, self‑contained implementation of that idea in Go. It focuses only on the caching layer (no LLM integration and no real embedding model) so that the performance and trade‑offs of semantic caching itself are easier to reason about.
Design Overview
Stack
- Language: Go
- Vector index:
github.com/coder/hnsw(pure‑Go HNSW implementation) - API:
net/httpwith a minimal JSON interface
Data Model
At its core, the cache maintains two structures:
SimilarityCache
├── index: hnsw.Graph[string]
└── items: map[string]CacheItem
The HNSW graph stores normalized embedding vectors and is keyed directly by the cache key (string). Metadata such as the cached value and TTL expiry is stored separately in a Go map. Using Graph[string] avoids a secondary ID‑to‑key mapping layer and keeps the implementation straightforward.
Lookup Path
A lookup follows a simple decision tree:
- Exact hit check – If the key exists and is unexpired, return immediately
(O(1)). - Similarity search – Normalize the query vector and perform a 1‑NN ANN search in the HNSW graph.
- Threshold test – If cosine similarity exceeds a configurable threshold and the candidate entry is still valid, return it as a SIMILAR hit.
- Miss – Otherwise, treat the request as a cache miss.
This preserves the semantics of a traditional cache while adding a controlled fallback to approximate matching.
Insert and Expiry
On insert, vectors are normalized once and added to both the item map and the HNSW graph. TTL expiration is handled lazily during lookups and eagerly by a background goroutine that periodically evicts expired entries from both data structures.
API Surface
The service exposes three endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/insert |
POST | Insert a key, vector, value, and TTL |
/lookup |
POST | Exact + similarity lookup |
/batch_lookup |
POST | Vectorized version of lookup |
Each lookup returns a status (HIT, SIMILAR, or MISS), the matched key (if any), the similarity score, and the cached value.
Experimental Setup
To evaluate the cache in isolation, without an LLM or real embedding model, the experiment uses synthetic vectors that approximate the behavior of semantic embeddings.
- Topics: 20 fixed “topic” vectors in ℝ⁶⁴, each representing a canonical question.
- Paraphrases: Generated by adding Gaussian noise to a topic vector and re‑normalizing, simulating how paraphrased text clusters in embedding space.
- Workload: 2,000 queries total.
- 70% paraphrases of known topics (should be cacheable).
- 30% genuinely new topics (true misses).
- Baseline: A plain exact‑match cache implemented as
map[string][]byte.
The exact cache is intentionally disadvantaged here: every paraphrase has a unique key, so it can only hit on literal repeats.
Results
Across multiple noise and threshold settings, the pattern is consistent:
- The exact‑match cache records essentially zero hits.
- The similarity cache recovers ~69–70% hit rates, closely matching the fraction of semantically known queries.
- Lookup latency increases from tens of nanoseconds (exact map lookup) to ~8–9 microseconds for ANN search.
In absolute terms, this is roughly a 200–300× slowdown per cache lookup. In practice, when amortized against a 500 ms–5 s LLM call, the additional cost is negligible.
Two observations stand out:
- Threshold tuning matters. Lowering the similarity threshold improves recall but increases the risk of false positives.
- Noise tolerance is high. Even with relatively loose paraphrases, cosine similarity remains comfortably above common thresholds (0.8–0.9).
Trade‑offs
Strengths
- Recovers cache hits for semantically equivalent queries that exact caches miss entirely.
- Precision–recall trade‑off is explicit and tunable via a single parameter.
- TTL semantics remain simple and familiar.
- ANN lookup overhead is small relative to downstream model costs.
Limitations
- False positives are possible, especially with aggressive thresholds.
- Cache quality depends entirely on the embedding model used upstream.
- HNSW is approximate; nearest neighbors are not guaranteed.
- Memory usage grows with the number of cached entries.
- Not a good fit for strictly structured or safety‑critical queries.
Takeaways
This project is intentionally modest, but it demonstrates a useful idea: once embeddings are already part of your stack, semantic caching is a low-friction way to trade a few microseconds of CPU time for substantial reductions in expensive model calls.
The implementation here is not production-ready, but it clarifies the mechanics and the cost model. In real systems, the harder problems lie upstream—choosing embeddings, defining safe similarity thresholds, and deciding which queries are allowed to match approximately. The cache itself is the easy part.
Seen this way, semantic caching is less a clever optimization and more a necessary abstraction once language models sit on the critical request path.
Future Work
These are extensions that would make the cache more realistic without fundamentally changing its design:
- Pluggable embedding interface: Integrate a real embedding provider (e.g., OpenAI, local ONNX models) behind a clean interface to separate cache mechanics from representation quality.
- Adaptive thresholds: Dynamically adjust similarity thresholds based on observed false-positive rates or query categories.
- Per-entry confidence metadata: Store similarity distributions or insertion-time statistics to make match decisions more conservative for ambiguous entries.
- Sharded or hierarchical indexes: Partition the HNSW graph by coarse topic or namespace to reduce false matches and improve locality.
- Persistent index snapshots: Periodically serialize the HNSW graph to disk to reduce warm-up costs after restarts.
- Evaluation with real embeddings: Re-run the experiment using embeddings from an actual model to quantify semantic drift and edge cases.