NATS JetStreamMulti-TenancyGoDistributed Systems

Designing Tenant Isolation in a High-Throughput Messaging Platform

How I designed stream-per-domain isolation to prevent tenant cross-contamination in a production AI messaging platform.

March 1, 2026

Overview

I led the design and implementation of core messaging infrastructure for a multi-tenant AI platform. The central challenge: ensuring complete isolation between tenants in a NATS JetStream-based event pipeline, while maintaining high throughput and predictable latency.

The Problem

Early in the platform's lifecycle, all tenant events flowed through a single shared NATS stream. This worked at small scale. As tenant count grew, shared backpressure became a systemic risk, with one noisy tenant able to degrade event delivery for all others.

Key constraints:

  • At-least-once delivery guarantees
  • Sub-second event processing for real-time AI workflows
  • Zero cross-tenant data leakage (hard requirement for compliance)
  • Dynamic tenant provisioning (new tenants could be onboarded at any time)

Architecture Decision: Stream-Per-Domain

The core decision was to assign each tenant their own NATS JetStream stream, namespaced by domain:

  • Stream name: TENANT_<ID>
  • Subject prefix: tenant.<id>.events.*
  • Per-tenant retention, limits, and consumer configuration

This gave us full isolation at the NATS layer. A tenant experiencing high throughput couldn't affect another tenant's consumer lag.

Transactional Outbox Pattern

Publishing to NATS from within a database transaction is risky, since a temporary NATS outage would fail the entire transaction. We solved this with a transactional outbox:

  1. Events are written to an outbox table within the same database transaction as the business event
  2. A background outbox worker polls for pending messages and publishes to the appropriate tenant stream
  3. Successfully published messages are marked published; failures increment a retry counter with exponential backoff

The outbox worker used SELECT FOR UPDATE SKIP LOCKED for safe concurrent processing across multiple worker replicas.

Observability

We instrumented the full pipeline with OpenTelemetry:

  • Trace spans from event creation → outbox write → NATS publish → consumer processing
  • Custom metrics: consumer lag per tenant, outbox queue depth, publish error rate
  • Per-tenant SLO tracking via distributed tracing dashboards

This observability proved essential when debugging a scenario where one tenant's consumer had stalled due to a malformed event, letting us trace the exact message and consumer state without grepping logs.

Results

  • Eliminated cross-tenant interference entirely
  • Significant reduction in p95 event processing latency
  • Outbox pattern achieved zero message loss across an extended production run
  • Tenant onboarding fully automated, with stream creation triggered on account provisioning webhook

Trade-offs

The stream-per-tenant model increases NATS server resource usage linearly with tenant count. At the scale we operated this was fine; at significantly higher tenant counts we'd evaluate NATS subject-based namespacing within fewer streams, with application-level isolation.