v0.1.6 โ€” now on PyPI

ThriftLM

Semantic cache layer for LLM apps. Exact-match in microseconds. Near-match in milliseconds. PII scrubbed before anything touches your database.

Get started GitHub
73.5%
Hit rate @ 0.82
1ms
Semantic lookup
1200ร—
Faster than LLM
0PII
Stored in plain text
Architecture

Three tiers.
One call.

Every query cascades through three layers before ever hitting your LLM provider. Most never make it past the second.

๐Ÿ“ฅ
Query
Raw input
โ†’
โšก
Redis
Exact hash ~0.5ms
โ†’
๐Ÿงฎ
LocalIndex
Numpy cosine ~1ms
โ†’
๐Ÿค–
LLM
Miss only
โ†’
๐Ÿ”’
PII Scrub
Then store
โœ“ HIT
~1ms
Redis or LocalIndex match. No LLM billed. No wait.
โ†ป MISS
Full latency
LLM called once. Response PII-scrubbed and stored. All future hits are free.
Quickstart

Up in
sixty seconds.

Your existing LLM function stays exactly as-is. One wrapper replaces every direct call.

bash
pip install thriftlm
# For the local dashboard + API server:
pip install thriftlm[api]
python
from thriftlm import SemanticCache
import openai

# Initialize once per process โ€” bulk-loads embeddings into local numpy index
cache = SemanticCache(threshold=0.82, api_key="your-key")

def call_llm(query: str) -> str:
    res = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    )
    return res.choices[0].message.content

# Drop-in. Handles cache check + LLM fallback automatically.
response = cache.get_or_call("Explain semantic caching", call_llm)

# Near-duplicate query โ†’ instant hit, no LLM called
response2 = cache.get_or_call("What is semantic caching?", call_llm)
.env
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key
REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...
01

Supabase + pgvector

Create a project at supabase.com and run supabase/setup.sql to create the HNSW-indexed cache table.

02

Redis

Run docker compose up -d for local Redis, or point REDIS_URL at Upstash for a managed option.

03

Wrap your LLM call

Replace your direct LLM call with cache.get_or_call(). That is the entire integration.

04

View metrics

Run thriftlm serve --api-key your-key. Opens a live dashboard at localhost:8000 โ€” bundled in the package, nothing extra to deploy.

bash
# See live hit rate, tokens saved, and top cached queries
thriftlm serve --api-key your-key
# โ†’ ThriftLM dashboard โ†’ http://localhost:8000
# โ†’ Opens in browser automatically
What's included

Everything.
Nothing extra.

Open-source, self-hostable, no vendor lock-in. Your data stays on your infrastructure.

โšก
Redis Exact Cache
Embedding hash stored in Redis. Sub-millisecond exact matches with zero Supabase round-trips.
๐Ÿงฎ
Local Numpy Index
All embeddings loaded into a float32 matrix at startup. Cosine similarity via matmul โ€” no network latency ever.
๐Ÿ”’
PII Scrubbing
Presidio + spaCy strips names, emails, and phone numbers from every LLM response before it touches the database.
๐Ÿ—„๏ธ
Supabase HNSW
pgvector with HNSW index for accurate approximate nearest-neighbor search at any scale.
๐Ÿ“Š
thriftlm serve
Built-in metrics dashboard. Run thriftlm serve --api-key sc_xxx for live hit rates, tokens saved, and top queries โ€” bundled in pip, nothing to deploy.
๐Ÿ”‘
Multi-tenant
Each API key is an isolated cache namespace with its own LocalIndex. Ship it as a service.
QQP Benchmark

Real numbers.
No cherry-picking.

200 duplicate question pairs from the Quora Question Pairs dataset. question1 stored, question2 used for lookup. Threshold controls the precision/recall tradeoff.

ThresholdHit RateHits / 200
0.7092.5%185
0.7586.0%172
0.8078.0%156
0.82 โ† default73.5%147
0.8562.5%125
0.9040.0%80

all-MiniLM-L6-v2 ยท mean sim=0.859 ยท HNSW index (Supabase pgvector)

Roadmap

Shipped.
What's next.

V1 is the cache primitive. V2 is where it gets interesting.

shipped

V1 โ€” Semantic Cache Core

Redis โ†’ LocalIndex โ†’ HNSW three-tier pipeline. Presidio PII scrubbing. Multi-tenant FastAPI. thriftlm serve bundled dashboard CLI. pip install thriftlm.

next

V2 โ€” Context caching

Coming soon.

Get started

Your LLM calls are
expensive and slow.

Three lines of Python. Open-source. Your data never leaves your own infrastructure.