Runbooks
Step-by-step procedures for common operational tasks and incidents.
Deploy a hotfix
- Branch from
main, name ithotfix/<slug>. - Make the change. Run
npm run test:smokelocally. - PR → request 1 reviewer → smoke + e2e gate must pass.
- Merge.
pages:deployruns from CI onmain. - Verify via
/internal/test-dashboard— the next smoke run should be green within 5 minutes.
Roll back
npm run pages:deploy -- --branch <previous-deploy-id>
Or from Cloudflare Pages UI → Deployments → Rollback. Confirms in ~30 seconds.
Database schema changes do not roll back automatically. If the bad deploy migrated, restore from the backup at gs://xcity-backups/<date> first.
Stripe webhook is failing
- Stripe Dashboard → Developers → Webhooks → check error rate.
- If signature mismatch, verify
STRIPE_WEBHOOK_SECRETmatches the endpoint’s signing secret in Stripe. - If
500, check Cloudflare Pages logs for/api/billing/webhook. Most common cause: GoTrue admin token expired — see Secrets for rotation. - Replay failed events: Stripe Dashboard → Webhooks → Failed → Resend.
A user can’t log in
/dashboard/keys— confirm user exists in GoTrue admin.- Check
email_confirmed_at— unconfirmed users can’t sign in. - Resend confirmation: GoTrue admin → user → Resend.
- If email never arrived, check Resend dashboard — most often a domain reputation issue, not GoTrue.
Gateway latency spike
- Check TokenHub status page.
- If a single model is slow, LiteLLM should auto-failover; verify routing in admin.
- If the whole gateway is slow, escalate to infra — Solar Compute capacity or network path.
Where alerts go
Pages: PagerDuty rotation. Slack: #ops-alerts. Email: ops@xcity.one.
Last updated: