Designing Tenant Isolation in a High-Throughput Messaging Platform
How I designed stream-per-domain isolation to prevent tenant cross-contamination in a production AI messaging platform.
March 1, 2026
Overview
I led the design and implementation of core messaging infrastructure for a multi-tenant AI platform. The central challenge: ensuring complete isolation between tenants in a NATS JetStream-based event pipeline, while maintaining high throughput and predictable latency.
The Problem
Early in the platform's lifecycle, all tenant events flowed through a single shared NATS stream. This worked at small scale. As tenant count grew, shared backpressure became a systemic risk, with one noisy tenant able to degrade event delivery for all others.
Key constraints:
- At-least-once delivery guarantees
- Sub-second event processing for real-time AI workflows
- Zero cross-tenant data leakage (hard requirement for compliance)
- Dynamic tenant provisioning (new tenants could be onboarded at any time)
Architecture Decision: Stream-Per-Domain
The core decision was to assign each tenant their own NATS JetStream stream, namespaced by domain:
- Stream name:
TENANT_<ID> - Subject prefix:
tenant.<id>.events.* - Per-tenant retention, limits, and consumer configuration
This gave us full isolation at the NATS layer. A tenant experiencing high throughput couldn't affect another tenant's consumer lag.
Transactional Outbox Pattern
Publishing to NATS from within a database transaction is risky, since a temporary NATS outage would fail the entire transaction. We solved this with a transactional outbox:
- Events are written to an
outboxtable within the same database transaction as the business event - A background outbox worker polls for
pendingmessages and publishes to the appropriate tenant stream - Successfully published messages are marked
published; failures increment a retry counter with exponential backoff
The outbox worker used SELECT FOR UPDATE SKIP LOCKED for safe concurrent processing across multiple worker replicas.
Observability
We instrumented the full pipeline with OpenTelemetry:
- Trace spans from event creation → outbox write → NATS publish → consumer processing
- Custom metrics: consumer lag per tenant, outbox queue depth, publish error rate
- Per-tenant SLO tracking via distributed tracing dashboards
This observability proved essential when debugging a scenario where one tenant's consumer had stalled due to a malformed event, letting us trace the exact message and consumer state without grepping logs.
Results
- Eliminated cross-tenant interference entirely
- Significant reduction in p95 event processing latency
- Outbox pattern achieved zero message loss across an extended production run
- Tenant onboarding fully automated, with stream creation triggered on account provisioning webhook
Trade-offs
The stream-per-tenant model increases NATS server resource usage linearly with tenant count. At the scale we operated this was fine; at significantly higher tenant counts we'd evaluate NATS subject-based namespacing within fewer streams, with application-level isolation.