Runbooks

Step-by-step procedures for common operational tasks and incidents.

Deploy a hotfix

  1. Branch from main, name it hotfix/<slug>.
  2. Make the change. Run npm run test:smoke locally.
  3. PR → request 1 reviewer → smoke + e2e gate must pass.
  4. Merge. pages:deploy runs from CI on main.
  5. Verify via /internal/test-dashboard — the next smoke run should be green within 5 minutes.

Roll back

npm run pages:deploy -- --branch <previous-deploy-id>

Or from Cloudflare Pages UI → Deployments → Rollback. Confirms in ~30 seconds.

Database schema changes do not roll back automatically. If the bad deploy migrated, restore from the backup at gs://xcity-backups/<date> first.

Stripe webhook is failing

  1. Stripe Dashboard → Developers → Webhooks → check error rate.
  2. If signature mismatch, verify STRIPE_WEBHOOK_SECRET matches the endpoint’s signing secret in Stripe.
  3. If 500, check Cloudflare Pages logs for /api/billing/webhook. Most common cause: GoTrue admin token expired — see Secrets for rotation.
  4. Replay failed events: Stripe Dashboard → Webhooks → Failed → Resend.

A user can’t log in

  1. /dashboard/keys — confirm user exists in GoTrue admin.
  2. Check email_confirmed_at — unconfirmed users can’t sign in.
  3. Resend confirmation: GoTrue admin → user → Resend.
  4. If email never arrived, check Resend dashboard — most often a domain reputation issue, not GoTrue.

Gateway latency spike

  1. Check TokenHub status page.
  2. If a single model is slow, LiteLLM should auto-failover; verify routing in admin.
  3. If the whole gateway is slow, escalate to infra — Solar Compute capacity or network path.

Where alerts go

Pages: PagerDuty rotation. Slack: #ops-alerts. Email: ops@xcity.one.

Last updated: