ThriftLM

Quickstart

Up in
sixty seconds.

Your existing LLM function stays exactly as-is. One wrapper replaces every direct call.

bash

pip install thriftlm
# For the local dashboard + API server:
pip install thriftlm[api]

python

from thriftlm import SemanticCache
import openai

# Initialize once per process — bulk-loads embeddings into local numpy index
cache = SemanticCache(threshold=0.82, api_key="your-key")

def call_llm(query: str) -> str:
    res = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    )
    return res.choices[0].message.content

# Drop-in. Handles cache check + LLM fallback automatically.
response = cache.get_or_call("Explain semantic caching", call_llm)

# Near-duplicate query → instant hit, no LLM called
response2 = cache.get_or_call("What is semantic caching?", call_llm)

.env

SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key
REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...

Supabase + pgvector

Create a project at supabase.com and run supabase/setup.sql to create the HNSW-indexed cache table.

Redis

Run docker compose up -d for local Redis, or point REDIS_URL at Upstash for a managed option.

Wrap your LLM call

Replace your direct LLM call with cache.get_or_call(). That is the entire integration.

View metrics

Run thriftlm serve --api-key your-key. Opens a live dashboard at localhost:8000 — bundled in the package, nothing extra to deploy.

bash

# See live hit rate, tokens saved, and top cached queries
thriftlm serve --api-key your-key
# → ThriftLM dashboard → http://localhost:8000
# → Opens in browser automatically

What's included

Everything.
Nothing extra.

Open-source, self-hostable, no vendor lock-in. Your data stays on your infrastructure.

⚡

Redis Exact Cache

Embedding hash stored in Redis. Sub-millisecond exact matches with zero Supabase round-trips.

🧮

Local Numpy Index

All embeddings loaded into a float32 matrix at startup. Cosine similarity via matmul — no network latency ever.

🔒

PII Scrubbing

Presidio + spaCy strips names, emails, and phone numbers from every LLM response before it touches the database.

🗄️

Supabase HNSW

pgvector with HNSW index for accurate approximate nearest-neighbor search at any scale.

📊

thriftlm serve

Built-in metrics dashboard. Run thriftlm serve --api-key sc_xxx for live hit rates, tokens saved, and top queries — bundled in pip, nothing to deploy.

🔑

Multi-tenant

Each API key is an isolated cache namespace with its own LocalIndex. Ship it as a service.

QQP Benchmark

Real numbers.
No cherry-picking.

200 duplicate question pairs from the Quora Question Pairs dataset. question1 stored, question2 used for lookup. Threshold controls the precision/recall tradeoff.

Threshold	Hit Rate	Hits / 200
0.70	92.5%	185
0.75	86.0%	172
0.80	78.0%	156
0.82 ← default	73.5%	147
0.85	62.5%	125
0.90	40.0%	80

all-MiniLM-L6-v2 · mean sim=0.859 · HNSW index (Supabase pgvector)

Roadmap

Shipped.
What's next.

V1 is the cache primitive. V2 is where it gets interesting.

shipped

V1 — Semantic Cache Core

Redis → LocalIndex → HNSW three-tier pipeline. Presidio PII scrubbing. Multi-tenant FastAPI. thriftlm serve bundled dashboard CLI. pip install thriftlm.

V2 — Context caching

Coming soon.

Three tiers.
One call.

Up in
sixty seconds.

Supabase + pgvector

Redis

Wrap your LLM call

View metrics

Everything.
Nothing extra.

Real numbers.
No cherry-picking.

Shipped.
What's next.

V1 — Semantic Cache Core

V2 — Context caching

Your LLM calls are
expensive and slow.

ThriftLM

Three tiers.One call.

Up insixty seconds.

Supabase + pgvector

Redis

Wrap your LLM call

View metrics

Everything.Nothing extra.

Real numbers.No cherry-picking.

Shipped.What's next.

V1 — Semantic Cache Core

V2 — Context caching

Your LLM calls areexpensive and slow.

Three tiers.
One call.

Up in
sixty seconds.

Everything.
Nothing extra.

Real numbers.
No cherry-picking.

Shipped.
What's next.

Your LLM calls are
expensive and slow.