OnlyAllowAI sits inline between your AI agents and your LLM providers. Each request must pass a structured competency test — a Riddle — before bytes ever reach the model. This document explains the architecture, latency model, threat coverage, and audit surface in the level of detail your security and platform teams need.
A drop-in firewall for AI traffic. Replace your existing LLM endpoint URL with an OnlyAllowAI URL — nothing else changes in your application code. Behind the scenes, every request is tested against a competency contract you defined; passing requests stream through with negligible latency, failing ones are blocked with structured per-field feedback.
1. You stop bad AI calls before they cost you money. Every upstream request to OpenAI / Anthropic / Groq / Google is billed by token. OnlyAllowAI denies the request before the upstream call is opened — you don't pay for blocked traffic.
2. You meet compliance without re-architecting. Every decision is persisted to Postgres with a stable event ID, the API key used, the riddle attempted, and the model targeted. SOC 2, ISO 27001, and internal audit teams can query the firewall_events table directly.
3. You keep one kill-switch for every AI agent. A single
PATCH /v1/keys/<id> { "disabled": true } stops
all traffic from that agent — instantly, across every Cloud
Run replica, with no deploy.
The hot path runs the gate decision and the upstream stream on a single coroutine — no queues, no thread hops. The dashboard reads a separate event bus and can never stall a customer request.
The gate decision runs on the same async coroutine that opens the upstream connection. No IPC, no queue, no second hop.
Once the gate passes, the response is streamed byte-for-byte via httpx.aiter_raw() + FastAPI StreamingResponse.
The Looking Glass dashboard reads a separate EventBus. Pull every browser tab — the proxy doesn't notice.
Three of these allow the request through with zero buffering. The fourth blocks the request before the upstream LLM is contacted — so denied traffic costs you nothing in upstream tokens.
The agent already cleared this domain; cert in cache.
The asset has no riddle defined — allow by default.
100% of expected fields correct; cert issued (TTL 1h).
Any field wrong → 403 with per-field feedback. Upstream never called.
Every numbered step below runs on the same coroutine. Between step ⑤ (decision) and step ⑥ (forward) the proxy holds no lock, no database connection, and no Vault handle.
Auth dependency snapshots the API-key binding tuple (org_id, department_id, asset_id, riddle_id, provider, disabled) onto request.state.api_key_binding.
If api_keys.disabled = true, immediately return 403 agent_disabled and emit proxy.blocked. No riddle is even pulled.
Sliding-window ZSET in Redis: global 60 req/min + per-agent 30 req/min. In-memory fallback if Redis is unavailable.
Key: oaai:cert:{agent_id}:{gate_domain}. Cache hit → SPEED_PASS, riddle is never selected, grading never runs.
Resolution: bound_riddle_id → asset_id → riddle_id Redis cache (TTL 60s) → in-memory RiddleStore.select_for_challenge(domain, difficulty). No SQL on the repeat path.
AutoFormatter normalises raw output: strips markdown code fences, tries JSON, falls back to line-by-line key: value extraction. Returns a clean dict.
GateHandshake.evaluate() runs the OutputValidator across every expected_output using its declared match_type (exact / contains / regex). Score = correct ÷ total. Verdict = PASS only when score == 1.0.
On PASS: signed GateToken (JWT, TTL 5 min) for scope access:<domain> + CapabilityCertificate cached in Redis for the next 1 hour.
One stable event_id per real request → in-memory subscribers (SSE) and Postgres firewall_events table. queue.put_nowait() never blocks the proxy.
httpx.AsyncClient.stream("POST", upstream, …) + async for chunk in resp.aiter_raw(): yield chunk. nginx is configured with proxy_buffering off and Cloud Run with --timeout 900s for long generations.
Grading is pure CPU. There is no AI judging another AI — that would be slow, expensive, and non-deterministic. Instead each riddle ships with an answer key and three match strategies.
String-normalised equality. str(submitted).strip() == str(expected).strip(). Perfect for IDs, project names, version numbers.
Substring check: expected in submitted. Use for bucket names, partial paths, or "must mention X".
Pattern match: re.search(pattern, submitted). Use for IPs, semver ranges, free-form constraints.
// Riddle: GCP project config extraction { "gate_domain": "cloud_infrastructure", "difficulty": "standard", "prompt": "PROJECT_ID=acme-prod\nREGION=us-central1\nBUCKET=acme-prod-data", "expected_outputs": [ { "field_name": "project_id", "expected_value": "acme-prod", "match_type": "exact" }, { "field_name": "region", "expected_value": "us-central1", "match_type": "exact" }, { "field_name": "bucket", "expected_value": "acme-prod", "match_type": "contains" } ] }
A score of less than 1.0 is a failure.
Partial credit is recorded for training feedback but the gate does not open.
The 403 response carries a feedback object with one entry per field
— so the calling agent can self-correct deterministically.
The first time an agent passes a riddle for a domain, a
Capability Certificate is issued and cached. Every subsequent
request for the same (agent_id, domain) pair is allowed
through with a single Redis GET — no riddle pulled,
no grading run.
stateDiagram-v2
[*] --> Pending : Agent first request
Pending --> Challenged : Riddle selected
Challenged --> Active : Score == 1.0 / cert issued
Challenged --> Pending : Score < 1.0 / 403 + feedback
Active --> SPEED_PASS : Subsequent request hits cache
Active --> Expired : TTL elapsed (default 1h)
Active --> Revoked : Admin revokes / riddle edited
Expired --> Challenged : Re-challenge on next call
Revoked --> Challenged : Re-challenge on next call
SPEED_PASS --> Active : Cache still warm
Production uses RedisTokenManager — certs are shared across every Cloud Run replica. No cold-cert penalty on autoscale.
Default 1-hour TTL keeps the privilege window small. Adjust per-org or per-domain if your risk model requires shorter intervals.
Edit a riddle → every cert that solved it is revoked. Disable an API key → the next request is denied before grading.
The Riddle Firewall is a contract enforcer. It does not try to out-think a malicious prompt — it requires the AI to prove it can extract the right values from a known input.
project_id correctly suddenly hallucinates one.oaai-sk-… key being used by an unauthorised process.PATCH /v1/keys/<id> { "disabled": true } — instant, global, no deploy. proxy.blocked event for forensics.403 provider_mismatch.firewall_events with the riddle, score, feedback, and elapsed time.< 1 ms on the speed-pass path. 5–50 ms on the cold-grade path. Streaming response start time is dominated by the upstream LLM, not by us.
Async coroutines on every request. Cloud Run autoscales horizontally; Redis-shared cert cache means a SPEED_PASS earned on one replica is honoured by every other.
Sliding-window ZSET in Redis. Global 60 req/min + per-agent 30 req/min by default; both tunable per-org.
httpx.aiter_raw() + StreamingResponse. nginx proxy_buffering off, Cloud Run --timeout 900s. Long generations stream uninterrupted.
Auth binding snapshotted by the auth dependency. The only residual SQL is BYOK provider-key lookup — one short-lived session, closed before the upstream stream starts.
Anthropic Messages SSE is rewritten on-the-wire as OpenAI chat.completion.chunk SSE, so existing OpenAI/LiteLLM SDKs work unchanged through the firewall.
Every SPEED_PASS, PASSED, DENIED,
and proxy.blocked event is persisted to a Postgres table.
Your SOC 2 / ISO 27001 / internal audit team gets the same view as
your operators.
CREATE TABLE firewall_events ( event_id UUID PRIMARY KEY, org_id UUID NOT NULL REFERENCES organizations, event_type VARCHAR -- riddle.passed / proxy.blocked / firewall.allow ... agent_id VARCHAR, api_key_id UUID, gate_domain VARCHAR, riddle_id UUID, outcome VARCHAR -- passed / failed / speed_pass / no_riddle / blocked model VARCHAR, -- gpt-4o, claude-3-5-sonnet, ... provider VARCHAR, -- openai / anthropic / groq / google / xai elapsed_ms INTEGER, payload JSONB, -- full enriched event (agent, dept, asset, feedback) created_at TIMESTAMPTZ DEFAULT NOW() );
Every human action (create / edit / delete riddle, toggle API key, change settings) is logged to user_audit_log with IP address and target.
Every AI attempt is logged to attempts with submitted_outputs, score, and feedback. Per-agent and per-riddle indexes for fast queries.
BYOK provider keys are AES-encrypted at rest (core/key_crypto.py). Inspector ring buffer redacts Authorization / api_key patterns before storage.
Change the base URL of your OpenAI / Anthropic client to point at
api.onlyallow.ai — that's it. Your existing SDKs,
retry logic, streaming code paths, and observability all continue to
work.
from openai import OpenAI # BEFORE — going direct to OpenAI # client = OpenAI(api_key="sk-...") # AFTER — same SDK, firewalled traffic client = OpenAI( api_key="oaai-sk-...", # your OnlyAllowAI key base_url="https://api.onlyallow.ai/v1", # <— one line changed ) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "…"}], stream=True, extra_body={"oaai_answer": {"project_id": "acme-prod"}}, # riddle answer ) for chunk in response: print(chunk.choices[0].delta.content or "", end="")
Same API surface, three transparent guarantees:
(1) provider lock on the key forces requests to the correct upstream;
(2) Speed Pass keeps repeat-request overhead negligible;
(3) every call is recorded with a stable event_id for audit.
Endpoints, authentication, error codes, streaming, SSE event payloads, rate limits, BYOK setup, and SDK snippets across four languages. Bookmark this section.
https://api.onlyallow.ai — same-origin reverse proxy through nginx to Cloud Run. Use this for SDK base_url and all production traffic.
https://onlyallow-api-47084672302.us-central1.run.app — useful for staging tests when bypassing nginx. Skips the SSE-aware buffer-off proxy block.
Authorization: Bearer oaai-sk-… on every request. JWTs are used internally for the gate-handshake flow; you don't see them.
POST/v1/chat/completionsstream: true. Anthropic / Groq / Google / xAI / Ollama all transparently dispatched by model name. Carries the riddle answer in extra_body.oaai_answer.POST/v1/messagesGET/v1/events/streamGET/v1/events/statsGET/v1/keys/oaai/POST/v1/keys/oaai/{ name, department_id?, asset_id?, riddle_id?, provider? }.PATCH/v1/keys/<id>disabled: true. Disabling takes effect on the next request — globally, no deploy.POST/v1/riddlesversion and auto-revokes prior certs.GET/health · /auth/health{"status":"ok"}. Safe for Kubernetes / Cloud Run liveness checks.error fieldinvalid_api_keyAuthorization header. Check the prefix is oaai-sk-.agent_disabledPATCH /v1/keys/<id> { "disabled": false }.riddle_failedfeedback object in the response body shows the mismatch per expected_output.provider_mismatchprovider_bound. Either remove the lock or use a matching model.require_riddlerate_limitedretry_after_ms. Default 60 req/min/org & 30 req/min/agent.upstream_errorupstream.quota_exhausted/v1/billing/topup.event: firewall.allow
data: {
"event_id": "emo-1063",
"org_id": "a1b2c3…",
"user_id": "u-456",
"user_email": "ops@acme.com",
"agent_id": "agent-llama-70b",
"api_key_id": "k-789",
"api_key_name": "prod-llama-key",
"api_key_prefix": "oaai-sk-aBc1",
"domain": "analytics",
"provider": "ollama",
"provider_bound": "ollama",
"model": "llama-3.1-70b",
"department": "dept-001",
"department_name": "Analytics",
"asset": "asset-022",
"asset_name": "looker-dashboards",
"riddle_id": "r-555",
"module_type": "human", // human-bound vs ai-assigned
"outcome": "speed_pass", // passed / failed / speed_pass / no_riddle / blocked
"elapsed_ms": 1,
"created_at": "2026-05-17T12:39:48Z"
}
Event types emitted: riddle.passed, riddle.failed,
proxy.forward, proxy.blocked,
firewall.allow, firewall.deny. Every event
lands both on the SSE bus and in the firewall_events table.
curl -N https://api.onlyallow.ai/v1/chat/completions \ -H "Authorization: Bearer oaai-sk-..." \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o", "stream": true, "messages": [{"role":"user","content":"Summarise the project."}], "oaai_answer": {"project_id":"acme-prod","region":"us-central1"} }'
import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.OAAI_KEY, baseURL: "https://api.onlyallow.ai/v1", }); const stream = await client.chat.completions.create({ model: "claude-3-5-sonnet", stream: true, messages: [{ role: "user", content: "…" }], oaai_answer: { project_id: "acme-prod" }, // riddle answer }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0].delta.content ?? ""); }
import httpx, json async with httpx.AsyncClient(timeout=None) as c: async with c.stream( "GET", "https://api.onlyallow.ai/v1/events/stream", headers={"Authorization": f"Bearer {OAAI_KEY}"}, ) as r: async for line in r.aiter_lines(): if line.startswith("data: "): evt = json.loads(line[6:]) if evt["outcome"] == "blocked": alert_siem(evt)
package main import ("bytes"; "net/http") func killKey(adminToken, keyID string) error { body := []byte(`{"disabled": true}`) req, _ := http.NewRequest( "PATCH", "https://api.onlyallow.ai/v1/keys/"+keyID, bytes.NewReader(body), ) req.Header.Set("Authorization", "Bearer "+adminToken) req.Header.Set("Content-Type", "application/json") _, err := http.DefaultClient.Do(req) return err // effective on the very next request to that key, globally }
provider_keys table, AES-encrypted at rest via core/key_crypto.py. Decryption is just-in-time for the upstream request and never logged.
openai · anthropic · groq · google · xai · ollama. Detected by model name; lock per-key with provider.
POST /v1/keys/byok with { provider, api_key, label? }. The key is encrypted before the SQL INSERT — plain bytes never touch the DB.
# 1. Clone git clone https://github.com/onlyallowai/onlyallowai.git cd onlyallowai # 2. Boot Postgres + Redis + API docker compose up -d # 3. Apply migrations docker compose exec api alembic upgrade head # 4. Health check curl http://localhost:8000/health # → {"status":"ok"} # 5. Issue your first OAAI key (admin token from .env) curl -X POST http://localhost:8000/v1/keys/oaai/ \ -H "Authorization: Bearer $ADMIN" \ -d '{"name":"prod-key","provider":"openai"}'
Reference deployment: Cloud Run + Cloud SQL +
Memorystore Redis on GCP, fronted by nginx with
proxy_buffering off for SSE. Terraform IaC ships in
infra/terraform/ (7 modules) — see the
deployment guide
for the production wiring.
organizations.rate_limit_per_min.OAAI_CERT_TTL_SECONDS.OAAI_GATE_TOKEN_TTL.--timeout + nginx proxy_read_timeout.firewall_events is the durable record.client_max_body_size. Streaming responses are unbounded.python -m pytest tests/ -q --ignore=tests/test_v2 -m "not db" — same gate deploy.ps1 runs.
pytest tests/ -v with Cloud SQL reachable (or Docker Compose Postgres). 209 tests, ~6s.
pytest --cov=api --cov=gate_layer --cov=riddle_matrix --cov-report=html. Open htmlcov/index.html.
/v1. Breaking changes go to /v2 — the two versions run side-by-side for at least one quarter./v1/events/stream; clients can assume keys remain stable.oaai-sk- is stable. Do not parse the suffix; treat the whole string as opaque.
On the speed-pass path: under 1 ms before the first upstream byte is
requested. On the cold-grade path: 5–50 ms (pure CPU). Once the gate
passes, response chunks stream byte-for-byte via
httpx.aiter_raw() with no buffering — we cannot
slow down the upstream stream because we don't decode it.
/health, /auth/health) report no false positives — they don't touch DB / Redis.
The request returns NO_RIDDLE and is forwarded
unmodified — with the asset showing as EXPOSED in the
dashboard. The default is allow for backwards
compatibility; you opt-in to enforcement by attaching a riddle.
If your policy requires deny-by-default, flip the org-level
require_riddle flag.
Riddles live in Postgres (riddles table) and are
mirrored into an in-memory RiddleStore at boot and on
every CRUD operation. Edits are picked up by the firewall on the
next request — no redeploy required. The version
field is auto-bumped on update, which auto-revokes every
certificate that was earned by solving the previous version.
In-flight streams complete normally — we never interrupt the
byte channel. The next request from that agent for that
domain pays the full grading cost. This preserves the
no-buffering guarantee while still giving operators a kill switch.
If you need to terminate an in-flight stream immediately, use
PATCH /v1/keys/<id> { "disabled": true } —
but understand it blocks new requests, not bytes already
mid-flight.
Each org can register provider keys for OpenAI / Anthropic /
Groq / Google / xAI / Ollama. Keys are AES-encrypted at rest using
core/key_crypto.py. The auth-bound OAAI key can be
locked to one provider — mismatched model requests are denied
with 403 provider_mismatch.
No. The dashboard reads a separate event bus and can never stall a request. Close every browser tab and the gate keeps working, riddles still enforce, Speed Pass still fires, audit rows still land in Postgres.
Yes. Terraform IaC ships in infra/terraform/ (7
modules). Cloud Run + Cloud SQL + Memorystore Redis on GCP is the
reference deployment. Docker Compose ships for local dev. All
state lives in Postgres + Redis — no proprietary backing
store.
Read the in-depth Markdown document RiddleUserguide.md — it covers riddle anatomy, the four outcomes, evaluation pipeline, grading model, speed-pass mechanics, admin controls, and a full worked example with request/response samples.
Spin up an org in 60 seconds, attach your first riddle, point your
OpenAI SDK at api.onlyallow.ai — and watch every
request light up the Looking Glass.