How We Isolated Tenants on NATS JetStream Without Losing Our Minds
Stream-per-domain isolation sounds simple until you're managing 50 tenants, dynamic routing rules, and the occasional tenant that decides to publish 10,000 messages in a minute.
March 15, 2026
Multi-tenancy in messaging systems is deceptively hard. The happy path (one stream, many consumers) breaks down the moment you need different retention policies per tenant, or when one tenant's burst traffic saturates the consumer group for everyone else.
Here's how we approached it in production.
The Problem
We had a single shared NATS JetStream stream for all tenants. It worked fine with three tenants. With fifteen, we started seeing consumer lag bleed across tenants during busy periods. One tenant's aggressive publishing would push messages through faster than other tenants' consumers could keep up.
The root issue: a shared stream means shared backpressure.
Stream-Per-Domain Isolation
The solution we landed on was creating a separate NATS stream per tenant domain. Each stream gets:
- Its own retention policy (some domains need 7-day replay, others only 24 hours)
- Independent consumer groups that can't interfere with other tenants
- Per-tenant subject prefixes:
tenant.<id>.events.*
func createTenantStream(ctx context.Context, js nats.JetStreamContext, tenantID string) error {
streamName := fmt.Sprintf("TENANT_%s", strings.ToUpper(tenantID))
_, err := js.AddStream(&nats.StreamConfig{
Name: streamName,
Subjects: []string{fmt.Sprintf("tenant.%s.>", tenantID)},
Retention: nats.LimitsPolicy,
MaxAge: 7 * 24 * time.Hour,
MaxMsgs: 1_000_000,
Replicas: 3,
Discard: nats.DiscardOld,
})
return err
}
The Transactional Outbox
Stream isolation solved the isolation problem. But it introduced a new one: how do you guarantee a message gets published to the tenant's stream when the publish is part of a larger database transaction?
The answer is the transactional outbox pattern:
- Write your domain event to an
outboxtable inside the same database transaction as your business logic - A background worker reads from
outbox, publishes to NATS, and marks messages aspublished - If the NATS publish fails, the message stays in
outboxand retries
type OutboxMessage struct {
ID uuid.UUID
TenantID string
Subject string
Payload []byte
Status string // "pending" | "published" | "failed"
CreatedAt time.Time
PublishedAt *time.Time
}
Use a SELECT FOR UPDATE SKIP LOCKED query to let multiple outbox workers process messages in parallel without stepping on each other. Postgres handles the locking efficiently.
Concurrency Controls
With tenant isolation, we also got per-tenant concurrency controls for free. Each tenant stream has a bound on in-flight messages, so a burst from Tenant A doesn't degrade Tenant B's processing latency.
func createConsumer(js nats.JetStreamContext, tenantID string) (*nats.Subscription, error) {
return js.Subscribe(
fmt.Sprintf("tenant.%s.>", tenantID),
handleMessage,
nats.ManualAck(),
nats.MaxAckPending(500), // bound in-flight per consumer
nats.AckWait(30*time.Second),
)
}
Lessons Learned
- Dynamic stream creation is fine. NATS handles hundreds of streams without issue. Don't pre-create everything.
- Monitor consumer lag per stream, not aggregate. An aggregate metric masks per-tenant degradation.
- The outbox worker needs its own dead-letter strategy. Eventually-failing messages should move to a separate table and alert, not retry forever.
The total infrastructure cost of this approach was higher than a single shared stream, but the operational clarity was worth it. When a tenant has a problem, we know exactly which stream to look at.