Documentation
Self-Hosted Embeddings — Pi + Cloudflare Tunnel + Ollama
How the chat's RAG embeddings are generated on a home-hosted Raspberry Pi and reached securely from Vercel.
Self-Hosted Embeddings
The AI chat on shreyashg.com uses retrieval-augmented generation (RAG). The embeddings that power retrieval — both the pre-computed corpus vectors and the live query vectors — are generated on a self-hosted Raspberry Pi in the home network, reached by Vercel through a Cloudflare Tunnel. This document explains the architecture, where everything lives, and how to operate it.
Overview
- Chat model (Groq):
groq/compound, called from the Vercel function insrc/app/api/chat/route.ts. Not related to this document. - Embedding model (self-hosted):
bge-m3running on Ollama on a Raspberry Pi 5 (8GB, Debian Trixie). 1024-dim vectors, 8192-token context, open-source, top-tier retrieval quality. - Access path: Vercel function →
https://embed.blockscopes.com/api/embed→ Cloudflare Tunnel → Pi auth proxy → Ollama.
Architecture
┌────────────────────────┐
│ Vercel (shreyashg.com) │
│ /api/chat route │
└──────────┬─────────────┘
│ HTTPS + Bearer token
▼
┌────────────────────────┐
│ Cloudflare Edge │
│ embed.blockscopes.com │
└──────────┬─────────────┘
│ Cloudflare Tunnel
│ (outbound from Pi)
▼
┌────────────────────────────────┐
│ Raspberry Pi 5 (pi-5-1) │
│ │
│ cloudflared ──┐ │
│ ▼ │
│ Python auth proxy :11435 │
│ (bearer check, rate limit) │
│ │ │
│ ▼ │
│ Ollama :11434 │
│ model: bge-m3 │
└────────────────────────────────┘
- Nothing inbound on the home router. The tunnel is an outbound connection from the Pi to Cloudflare.
- Only the auth proxy's
/api/embedendpoint is reachable from the internet. Ollama itself is bound to127.0.0.1.
Components on the Pi
| Component | Purpose | Bind address | Systemd unit |
|---|---|---|---|
ollama |
Serves the bge-m3 embedding model |
127.0.0.1:11434 |
ollama.service |
embed-proxy |
Bearer auth, path allowlist, rate limiting | 127.0.0.1:11435 |
embed-proxy.service |
cloudflared |
Outbound tunnel to Cloudflare | — | cloudflared.service |
All three are systemd units and auto-start on boot.
File map on the Pi
| Path | What it is |
|---|---|
/usr/local/bin/ollama |
Ollama binary (downloaded from GitHub releases) |
/usr/share/ollama/ |
Ollama's home dir (models cached here) |
/opt/embed-proxy/proxy.py |
The Python auth+rate-limit proxy |
/etc/embed-proxy.env |
Bearer secret (mode 600, root-readable only) |
/etc/systemd/system/ollama.service |
Ollama service definition |
/etc/systemd/system/embed-proxy.service |
Proxy service definition (DynamicUser, PrivateTmp, ProtectSystem=strict) |
/etc/cloudflared/config.yml |
Tunnel ingress rules |
/etc/cloudflared/<tunnel-uuid>.json |
Tunnel credentials |
Secrets inventory
The bearer secret (EMBED_SECRET) is the only secret in this setup. It lives in three places and must match in all three:
| Location | How to read / edit |
|---|---|
Pi: /etc/embed-proxy.env |
sudo vi /etc/embed-proxy.env, then sudo systemctl restart embed-proxy |
Local dev: .env at repo root |
Edit the EMBED_SECRET= line. .env is gitignored; never commit it. |
| Vercel: Project Settings → Environment Variables | npx vercel env ls / add / rm |
Do not store the secret in documentation/, in git history, or in chat logs.
Rotating the secret
- Generate a new value:
openssl rand -hex 32 - Update
/etc/embed-proxy.envon the Pi:sudo vi /etc/embed-proxy.env, thensudo systemctl restart embed-proxy. - Update
.envat the repo root. - Update Vercel for each scope:
npx vercel env rm EMBED_SECRET production npx vercel env add EMBED_SECRET production --value '<new-secret>' --yes # repeat for development, preview - Redeploy so running functions pick up the new value:
npx vercel --prod.
Expected downtime during rotation: ~10 seconds while the proxy restarts. Existing chat conversations in flight will fail once and be retryable.
Service management
# Check status of all three
ssh shreyash@pi-5-1 'systemctl is-active ollama embed-proxy cloudflared'
# Restart a service
ssh shreyash@pi-5-1 'sudo systemctl restart embed-proxy'
# Tail logs
ssh shreyash@pi-5-1 'sudo journalctl -u embed-proxy -f'
ssh shreyash@pi-5-1 'sudo journalctl -u ollama -f'
ssh shreyash@pi-5-1 'sudo journalctl -u cloudflared -f'
Rate-limit configuration
Rate limiting happens in the proxy (/opt/embed-proxy/proxy.py) via a global token bucket.
| Constant | Current value | Meaning |
|---|---|---|
BUCKET_CAPACITY |
30 |
Max burst of authenticated requests |
REFILL_PER_SEC |
0.1 |
Sustained rate: 1 token per 10s = 6/min |
Properties:
- Only authenticated (
Bearer <secret>) requests count against the bucket. 401s are free — brute-force attempts don't deplete your quota. - When exhausted, the proxy returns
429 rate_limitedwith aRetry-After: <seconds>header. - Events are logged:
sudo journalctl -u embed-proxy | grep rate_limited.
To tune: edit the constants in proxy.py, then sudo systemctl restart embed-proxy. No code outside the Pi needs to know.
Sizing guidance:
- A single chat message = 1 embed request.
- A full Vercel deploy = as many embed requests as the index has chunks (currently ~80–120, batched at 8 per call so ~10–15 requests).
- Default config handles both comfortably.
Swapping the embedding model
The embedding model is named in one place: PROMPT_CONFIG.embeddings.model in src/lib/prompts.ts.
To switch to a different model (e.g., nomic-embed-text, mxbai-embed-large):
- Pull the model on the Pi:
ssh shreyash@pi-5-1 'sudo -u ollama /usr/local/bin/ollama pull <model-name>'. - Update
src/lib/prompts.tswith the new model name. - Important: vectors from different models are not comparable. You must rebuild the whole index.
- Locally:
npm run build:index. - Commit the updated
src/data/vector-index.json— actually, the index file is gitignored (it's regenerated by Vercel'sprebuild). - Deploy:
npx vercel --prod. Vercel's prebuild will hit the Pi and regenerate the index fresh.
If the new model has a different vector dimensionality (e.g., 768 vs. 1024), nothing else needs to change — the retrieval code is dimension-agnostic.
Rebuilding the vector index
The index lives at src/data/vector-index.json and is gitignored.
| When it's rebuilt | What runs |
|---|---|
npm run build (local or Vercel) |
prebuild hook → npm run build:index → hits Pi |
Manually: npm run build:index |
Same, without running next build after |
A rebuild takes ~60 seconds on a warm Pi (bge-m3 already in RAM). First build after a reboot adds ~3s to load the model.
Troubleshooting
| Symptom | Likely cause | Where to look |
|---|---|---|
Chat returns 503 server_configuration |
Env vars missing on Vercel | npx vercel env ls — confirm EMBED_URL, EMBED_SECRET, GROQ_API_KEY |
Chat returns 502 embed upstream |
Pi or proxy is down | ssh pi-5-1 systemctl is-active ollama embed-proxy cloudflared |
Chat returns 401 unauthorized from upstream |
Secret mismatch | Compare /etc/embed-proxy.env on Pi to Vercel env var value |
Chat returns 429 rate_limited |
Burst or DoS | journalctl -u embed-proxy -f — count rate_limit events |
Pi up but embed.blockscopes.com unreachable |
Cloudflared tunnel down | systemctl status cloudflared on Pi; check Cloudflare dashboard for tunnel status |
Build fails at npm run build:index |
Pi offline during build | Either bring Pi online, or skip the build-index step (the route will 503 on chat until the index is regenerated) |
Quick end-to-end smoke test from anywhere:
curl -s https://embed.blockscopes.com/healthz
# → "ok"
curl -s -X POST https://embed.blockscopes.com/api/embed \
-H "authorization: Bearer $EMBED_SECRET" \
-H "content-type: application/json" \
-d '{"model":"bge-m3","input":"smoke test"}' \
| python3 -c "import sys,json; print('dims:', len(json.load(sys.stdin)['embeddings'][0]))"
# → "dims: 1024"
Security threat model
What's actually reachable from the internet: one endpoint, POST https://embed.blockscopes.com/api/embed, only with a valid bearer token. Everything else returns 401 or 404.
Layers of defense (outside in):
- No inbound ports on the home router. Cloudflare Tunnel is outbound-only.
- Cloudflare edge. TLS termination, DDoS protection, IP reputation filtering — all free.
- Tunnel ingress allowlist. Exactly one hostname routes to exactly one internal service. Everything else gets
http_status:404. - Python auth proxy. Bearer check (256-bit random), path allowlist (only
POST /api/embedandGET /healthz), body size cap (256KB), token-bucket rate limit. - Ollama bound to
127.0.0.1. Unreachable from the tunnel — only the proxy can talk to it. - Systemd hardening on the proxy:
DynamicUser,NoNewPrivileges,PrivateTmp,ProtectSystem=strict,ProtectHome.
Worst case if the bearer secret leaks: an attacker can generate embeddings (CPU DoS, capped by the rate limiter). They cannot reach SSH, LAN, other services, the filesystem, or other Ollama endpoints (chat, generate, pull). Mitigation: rotate the secret (see above).
Setup history
This infrastructure was stood up in one session on 2026-04-19. The commits that introduced it are on main:
feat(chat): self-host bge-m3 embeddings, drop OpenAI- Rate-limit proxy upgrades applied on the Pi only (not in git — proxy source is on the Pi).
The proxy's Python source is intentionally kept on the Pi rather than checked in, because it references nothing from the repo and belongs with the server, not the site.