All guides
OperationsAdvanced1 hr

Operate CodeCourier in production

RBAC, budgets, alerts, audit logging, and the runbook your on-call needs. The full operations handbook for running CodeCourier across a real engineering org.

By Tomás Rivera
Head of Platform
Updated May 10, 2026

Prerequisites

  • A CodeCourier workspace with multiple repos connected
  • An on-call rotation and an existing alerting destination (PagerDuty, Opsgenie, Slack)
  • Authority to set spending limits

Hobby usage and production usage of CodeCourier diverge sharply. A single engineer running a few sessions a day needs almost no operational discipline. An engineering org with dozens of users, unattended workflows, and Sprint Chains running overnight needs all of it. This guide is the handbook for that second world.

1. Map your roles before users get them

Open Workspace → Roles and define four roles before you invite a single teammate. The default permissive setup is wrong for production.

  • Viewer - can see sessions and PRs, cannot launch. Use for stakeholders, PMs, and curious cross-team observers.
  • Operator - can launch Issue Sessions against repos they have GitHub access to. Cannot modify personas or workflows. Use for most engineers.
  • Builder - can edit personas and workflows. Use for senior engineers who own the agent infrastructure.
  • Admin - full workspace control. Limit to two or three people, including at least one with security org membership.

2. Set budgets at every level

Money is the first thing that goes wrong in unsupervised AI systems. CodeCourier enforces budgets at three levels - workspace, repo, and persona. Set all three.

budgets:
  workspace_monthly_usd: 5000
  per_repo_daily_usd:    200
  per_persona_run_usd:   25

actions_on_exceed:
  - notify slack channel #codecourier-budget
  - pause new sessions on the affected scope
  - require admin acknowledge to resume

The pause-on-exceed behaviour is the important one. Soft notifications get ignored; hard pauses force a conversation.

3. Wire alerts to your on-call

In Workspace → Alerts, configure three alert channels: budget overage, session failure rate above 10% in any 15-minute window, and any session running longer than 2 hours. Route the first two to your on-call rotation and the third to a low-priority channel.

The 2-hour timeout is opinionated for a reason. Sessions that take longer almost always indicate a misconfigured workflow, not genuinely long work. Catch them early.

4. Enable full audit logging

Turn on the audit log export to your SIEM (Datadog, Splunk, Sumo, whatever you use). Every session launch, persona edit, budget override, and PR merge is logged with actor, timestamp, and full request body. Your security team will want this before signing off on production usage; give it to them before they ask.

5. Write a one-page runbook

Document, in one page, the five most likely production incidents and how to respond. For each, the answer should be three lines max: diagnose, immediate action, escalation.

  • Budget pause - who can acknowledge, how to raise the cap.
  • Stuck session - how to kill, how to capture partial output.
  • Persona regression - how to roll back to the previous version.
  • Bad PR opened - how to revert and re-run with revised plan.
  • GitHub integration token revoked - re-install procedure.

Pin this page in your on-call channel. The runbook is the difference between a 10-minute incident and a 2-hour one.

6. Schedule the boring maintenance

Two recurring tasks make production usage durable. Run them weekly and put a calendar reminder on the team owner's calendar.

  • Persona review - sample five recent sessions per persona, confirm output still meets the bar, retire any persona unused for 30 days.
  • Budget retrospective- read the previous week's spend by repo and persona, raise or lower caps based on actual usage.

7. Next steps

With operations in place you have the foundation for everything else. Pointing the integration at your full backlog, fanning Sprint Chains across the fleet, and trusting unattended workflows all become safer once the operational discipline is there. The order matters: do not invert it.

Tomás Rivera
Head of Platform
Tags
#operations#rbac#budgets#audit#sre#runbook
Share

Keep building

Free for 14 days · no credit card

Hire your first AI engineer.
Ship by lunchtime.

5 minutes to onboard. First PR within an hour. Cancel anytime.