How We Isolated Tenants on NATS JetStream Without Losing Our Minds

Multi-tenancy in messaging systems is deceptively hard. The happy path (one stream, many consumers) breaks down the moment you need different retention policies per tenant, or when one tenant's burst traffic saturates the consumer group for everyone else.

Here's how we approached it in production.

The Problem

We had a single shared NATS JetStream stream for all tenants. It worked fine with three tenants. With fifteen, we started seeing consumer lag bleed across tenants during busy periods. One tenant's aggressive publishing would push messages through faster than other tenants' consumers could keep up.

The root issue: a shared stream means shared backpressure.

Stream-Per-Domain Isolation

The solution we landed on was creating a separate NATS stream per tenant domain. Each stream gets:

Its own retention policy (some domains need 7-day replay, others only 24 hours)
Independent consumer groups that can't interfere with other tenants
Per-tenant subject prefixes: tenant.<id>.events.*

func createTenantStream(ctx context.Context, js nats.JetStreamContext, tenantID string) error {
    streamName := fmt.Sprintf("TENANT_%s", strings.ToUpper(tenantID))

    _, err := js.AddStream(&nats.StreamConfig{
        Name:       streamName,
        Subjects:   []string{fmt.Sprintf("tenant.%s.>", tenantID)},
        Retention:  nats.LimitsPolicy,
        MaxAge:     7 * 24 * time.Hour,
        MaxMsgs:    1_000_000,
        Replicas:   3,
        Discard:    nats.DiscardOld,
    })
    return err
}

The Transactional Outbox

Stream isolation solved the isolation problem. But it introduced a new one: how do you guarantee a message gets published to the tenant's stream when the publish is part of a larger database transaction?

The answer is the transactional outbox pattern:

Write your domain event to an outbox table inside the same database transaction as your business logic
A background worker reads from outbox, publishes to NATS, and marks messages as published
If the NATS publish fails, the message stays in outbox and retries

type OutboxMessage struct {
    ID         uuid.UUID
    TenantID   string
    Subject    string
    Payload    []byte
    Status     string // "pending" | "published" | "failed"
    CreatedAt  time.Time
    PublishedAt *time.Time
}

💡

Use a SELECT FOR UPDATE SKIP LOCKED query to let multiple outbox workers process messages in parallel without stepping on each other. Postgres handles the locking efficiently.

Concurrency Controls

With tenant isolation, we also got per-tenant concurrency controls for free. Each tenant stream has a bound on in-flight messages, so a burst from Tenant A doesn't degrade Tenant B's processing latency.

func createConsumer(js nats.JetStreamContext, tenantID string) (*nats.Subscription, error) {
    return js.Subscribe(
        fmt.Sprintf("tenant.%s.>", tenantID),
        handleMessage,
        nats.ManualAck(),
        nats.MaxAckPending(500), // bound in-flight per consumer
        nats.AckWait(30*time.Second),
    )
}

Lessons Learned

Dynamic stream creation is fine. NATS handles hundreds of streams without issue. Don't pre-create everything.
Monitor consumer lag per stream, not aggregate. An aggregate metric masks per-tenant degradation.
The outbox worker needs its own dead-letter strategy. Eventually-failing messages should move to a separate table and alert, not retry forever.

The total infrastructure cost of this approach was higher than a single shared stream, but the operational clarity was worth it. When a tenant has a problem, we know exactly which stream to look at.