0051 — Scale Targets and Performance Commitments (Batch 4)¶
Status¶
Accepted
Context¶
Eight scale and performance items in the SK series (SK25–SK34, minus SK33 already closed by ADR-0043) sat unresolved as "we'll measure it later." That's not a decision — it's a deferral. Without explicit budgets, profile commitments, and optimization paths declared up front, the launch infrastructure has no way to verify it's hitting design intent, and post-launch tuning becomes a free-for-all.
The scope spans:
- Capacity — galaxy concurrent-region target, ARIA per-player storage and cost ceilings.
- Generator throughput — Phase 9 SCC re-check cost, formation budgets at edge region sizes.
- Live-state hot paths — sector-presence write amplification, realtime-bus SLO, planetary production tick batching, market-price update rate limiting.
- Index strategy — faction reputation and sector-faction-influence query patterns.
These aren't design pivots; they're operational commitments. The pattern across all of them is the same: declare a target, commit to a profile, document the optimization path if the target is missed. This ADR locks the targets and commitments so the launch team has explicit success criteria.
WR13 (Batch 6 in DECISIONS.md) — market-price tick rate-limiting — is also closed by this ADR via SK30. SK33 (region-attachment surge bottleneck at Nexus) is already resolved by ADR-0043's per-region Gateway Plaza sector hashing and is not re-addressed here.
Decision¶
SK25 — Concurrent-region capacity target¶
The Galaxy has no hard sector cap (per ADR-0050 SK23). The launch capacity target is a soft observability target, not a hard limit:
- Target: 50 concurrent active player-owned regions (~50,000–60,000 sectors total) on launch infrastructure.
- Operator dashboards alert at 75% of target (37–38 regions) for capacity-planning lead time.
- Post-launch scaling path:
- Vertical — bigger DB instance, more RAM, more compute. First lever.
- Horizontal — region sharding across multiple gameserver clusters via the existing multi-regional architecture per ADR-0001. Second lever.
- Archival — idle regions auto-tombstone after 6 months of zero player activity (zero logins from any resident, zero traversals through the region's Nexus warp). Tombstoned regions are unloaded from active memory but retained on disk for fast resurrection if a player returns. Third lever.
The target is tunable; the commitment is to alert before capacity becomes a launch-blocking issue.
SK26 — Phase 9 SCC re-check budget¶
Galaxy-generator Phase 9 (formation stamping with strongly-connected-component reachability validation) does ~96k graph ops per Standard region.
Budget: Phase 9 completes in < 30 seconds on launch infrastructure (single-region regen).
Profile commitment: a representative regen runs at staging during launch validation. If Phase 9 exceeds budget:
- Optimization path: batch SCC re-checks — instead of running the validation after each formation stamp, accumulate N formations and re-check once per batch. Tuning value of N is data-driven (likely 3–5 formations per batch). The graph topology rule is preserved because batch validation still runs against the post-batch state.
The budget is the commitment; the optimization path is contingent on profile data.
SK27 — Sector.players_present write amplification¶
Postgres rewrites the full JSONB on each update. At 10k concurrent players with frequent sector transitions, the write rate hits ~666 writes/sec — large JSONB blobs being rewritten constantly.
Migration trigger: P99 write latency > 100ms sustained over a 5-minute window. Until the trigger fires, the JSONB-only path is acceptable.
When the trigger fires, the write path migrates to a normalized sector_player_presence join table:
| Column | Type | Notes |
|---|---|---|
sector_id |
UUID FK Sector.id | indexed |
player_id |
UUID FK Player.id | indexed |
entered_at |
DateTime | for "how long has X been here" queries |
Composite UNIQUE on (sector_id, player_id). The JSONB column on Sector stays as a denormalized read cache updated alongside the join table inserts/deletes via a service-layer write helper (one transaction, both surfaces written together). UI queries hit the JSONB; live presence/scanning queries hit the join table.
The migration is forward-only Alembic when the trigger fires. The two surfaces stay in sync via the service-layer write helper indefinitely.
SK28 — Realtime-bus SLOs¶
Pre-launch load test commitment with explicit SLO:
- 5,000 concurrent WebSocket connections per gameserver instance
- P99 message delivery latency < 200ms
- Test runs at staging on production-class hardware, simulating peak event volume (combat resolutions, sector-presence updates, market-tick fanout, ARIA notifications).
If SLO is unmet at the load test, decision between two paths:
- (a) Horizontal scale-out — multiple gameserver instances behind a connection-router. The router shards by user_id hash. Per-instance load drops; total capacity scales linearly with instance count.
- (b) Connection cap with queue — hard cap at the SLO threshold; new connections queue until a slot frees. Acceptable for non-critical players (read-only browsers); not acceptable for actively-playing subscribers.
The choice is deferred to test results; (a) is the strong default.
SK29 — Planetary production tick batching¶
Per-tick JSONB diff/persistence overhead becomes prohibitive at scale.
Per-region batched tick: 5 ticks/min (every 12 seconds) processing all planets in the region in a single bulk UPDATE per region per tick. Reduces per-tick overhead ~10× vs per-planet writes.
Budget: P99 region-tick latency < 500ms; alert at 80% of budget. Profile commitment same as SK26 — staging regen at launch validation.
If budget is missed: shard the per-region batch by planet count (e.g., regions with > 100 planets split into two parallel batches), or move to per-region async workers (one worker per region drains a planet-tick queue continuously).
SK30 — Market-price tick rate limiting (closes WR13)¶
Per-transaction price updates have no rate limiting; bursty traffic at popular stations could cascade-update prices many times per second per station.
Per-station 1-second min-interval rate limit on consecutive price updates. All transactions within the 1-second window batch into a single end-of-window price recomputation:
def maybe_recompute_price(station, transaction):
last_recomputed_at = station.last_price_recomputed_at
if last_recomputed_at and (now() - last_recomputed_at).seconds < 1:
# Within window — append to pending batch, no recomputation yet
station.pending_price_recomputation = True
return
# Window elapsed — recompute including any pending batch
recompute_price_now(station)
station.last_price_recomputed_at = now()
station.pending_price_recomputation = False
A per-station scheduler tick every 1 second flushes any station with pending_price_recomputation = True whose 1-second window just elapsed. This composes naturally with the planetary production tick (SK29) — both are per-region scheduled work.
Resolves WR13 (Batch 6) simultaneously: WR13 was the explicit "market-price tick rate-limiting" decision; SK30 closes it.
SK31 — ARIA storage and cost budgets¶
ARIA's per-player observation-log storage scales with player engagement; LLM costs scale with request volume.
Per-player storage budget:
- Target: 100 KB baseline (most players, casual engagement).
- Hard cap: 10 MB. On overage, the prune service runs (per the existing observation-log model in ADR-0038) — oldest observations are dropped first until the player is back under cap. Player gets a one-line ARIA notification that older observations have been archived.
Cost ceilings:
- Per-player monthly LLM token cost: $0.50 target (soft — based on canonical request volume in
OPERATIONS/aria.md). - Server-side daily ceiling: $50 per 1,000 active players (soft circuit-breaker). On overage, ARIA degrades to manual fallback responses (deterministic templates in lieu of LLM-generated dialogue). The fallback path is documented in
OPERATIONS/aria.md.
These compose with the existing server-binding rate/cost caps. Operator dashboards alert at 75% of all three ceilings.
SK32 — Reputation index strategy¶
Reputation (per-player per-faction standing) and SectorFactionInfluence (per-region per-sector per-faction influence) both have high cardinality. Without documented index strategy, ad-hoc query patterns can drag the database.
Reputation indexes (target schema in services/gameserver/src/models/reputation.py):
(player_id, faction_id)UNIQUE composite — primary access pattern, "what's my rep with faction X."(faction_id, score DESC)— "top-reputation leaderboard with faction X" queries.(player_id)— "all my reputation rows" full scan.
SectorFactionInfluence indexes are already locked in ADR-0021 and the DATA_MODELS/gameplay.md spec:
(region_id, sector_number)— sector-side reads.(faction_code, region_id)— per-faction queries.- Composite UNIQUE on
(region_id, sector_number, faction_code).
Ratified here.
Documented query patterns in both data-model docs so future authors don't add new patterns unaware. New query patterns require an index review.
SK34 — Formation budget edge bands¶
The existing formation budget table in SYSTEMS/special-formations-generation.md covers 100–1,200 sectors. Two edge bands aren't covered:
Tiny regions (100–300 sectors): 5-hop minimum spacing between formations is hard to satisfy; the worldgen often fails to stamp any formation at all. Spec:
- 0–1 formation total (any type) for the entire region.
- 5-hop spacing relaxed to 3-hop for tiny regions only.
- Result: some tiny regions have a single notable formation; some have none. Acceptable — a 100-sector region is a niche operator deployment, not a Standard player region.
Extra-large regions (1,201–1,500 sectors) — these become possible at higher subscription tiers (📐 future). Spec extends the existing curve:
| Sectors | Bubble | Tunnel | Dead-End | Blister | Escape Hatch | Warp Sink | Backdoor (per Bubble) |
|---|---|---|---|---|---|---|---|
| 1,201–1,500 | 4–5 | 3–4 | 5–6 | 2–3 | 1–2 | 2–3 | 1–2 |
This continues the existing 100–300, 301–600, 601–800, 801–1,200 ceiling curve at roughly 20% increments per band.
Consequences¶
Positive:
- Every major scale-and-performance question has an explicit budget, profile commitment, or optimization path. Launch validation has measurable success criteria.
- The "declare-and-defer" pattern (target now, optimize on data) keeps premature optimization out of the launch path while not leaving the question open.
- WR13 closes ride-along with SK30; one Batch 6 item is already done.
- ARIA cost ceilings make the per-player AI feature financially predictable for the operator. Server-side daily circuit-breaker prevents runaway LLM bills.
- Formation budgets at edge bands close two corner cases that would have surfaced as "weird tiny region has no formations" or "1,300-sector region budgets are made up at runtime."
Neutral:
- One new column on
Station:last_price_recomputed_at+pending_price_recomputationfor SK30's rate limit. Minor. - One new table (
sector_player_presence) is conditionally migrated when SK27's trigger fires; not at launch unless profile data demands. - ARIA prune service and cost-ceiling circuit-breaker are new runtime concerns documented in
OPERATIONS/aria.md.
Negative:
- The targets are guesses calibrated against design intent; first contact with real player traffic will adjust them. Acceptable — every target carries a "data-driven tuning" commitment.
- The conditional migration model for SK27 (JSONB-only until P99 > 100ms) creates an implicit "two valid states" period where some deployments use one path and some another. Documentation must be clear about which is canonical at any given operational moment.
- Per-region batching for SK29 means a small amount of staleness (up to 12 seconds) in player-facing planet production reads. Acceptable — production tick output is updated on the next read regardless.
Alternatives considered¶
Set hard caps instead of soft targets (rejected). Easier to enforce but harder to grow. The user's framing in Batch 3 was clear: the universe should grow with paying subscribers, not be artificially capped. Soft targets with alerting preserve growth flexibility.
Defer all profile-based commitments to post-launch (rejected). "We'll measure when it breaks" leaves the launch team with no success criteria and the operator with no early-warning signals. Pre-launch staging profiles for Phase 9 (SK26), region tick (SK29), and realtime-bus (SK28) are the right time to measure.
Migrate sector_player_presence at launch unconditionally (rejected for SK27). The normalized join table is more code and more state to maintain. If JSONB-only meets P99 < 100ms in real load, the migration is unnecessary work. Trigger-based migration keeps the launch surface simple.
Per-player ARIA cost cap with hard rejection rather than fallback (rejected for SK31). A hard "you've hit your monthly token budget, no more ARIA dialogue" surface is hostile to players. The manual-fallback degradation preserves the player-facing feature with reduced richness, which is a better failure mode.
Profile-driven only, no declared targets (rejected). "Run a profile and see" without a target gives no anchor for "is this acceptable?" Each item declares a number; profiles validate against the number; tuning happens against the number.
Related docs¶
OPERATIONS/multi-regional.md— capacity targets and scaling path (SK25).SYSTEMS/galaxy-generator-design.md— Phase 9 SCC budget (SK26).SYSTEMS/sector-presence.md— write-path normalization plan (SK27).SYSTEMS/realtime-bus.md— SLO + load test commitment (SK28).SYSTEMS/planetary-production-tick.md— batched bulk-update spec (SK29).SYSTEMS/market-pricing.md— 1-second min-interval rate limit (SK30; closes WR13).OPERATIONS/aria.md— storage + cost ceilings (SK31).DATA_MODELS/player.md,DATA_MODELS/gameplay.md— composite-index callouts (SK32).SYSTEMS/special-formations-generation.md— formation budget edge bands (SK34).ADR/0001-use-multi-regional-architecture.md— multi-regional foundation (SK25 scaling path).ADR/0021-f5-territory-taxonomy-influence-math.md—SectorFactionInfluenceindexes already locked (SK32).ADR/0038-aria-observation-log-learning-model.md— observation-log prune semantics referenced by SK31.ADR/0043-sk4-nexus-natural-warp-frontier.md— already closes SK33 (region-attachment surge).ADR/0050-batch3-provisioning-lifecycle-hardening.md— SK23 (no Galaxy hard cap) that SK25's soft target depends on.