Dark Software Factory: Groupon's Lights-Out CMS on GitHub Actions

Q: How do I prevent the agent from deleting production data?

Three layers: First, the `allowed_tools` pattern blocks dangerous Bash commands. Second, branch protection rules on `main` prevent direct pushes. Third, the agent never runs on `main`; it always commits to a feature branch. Even if the agent went rogue, it could only delete files on a branch that gets reviewed before merge.

Q: What happens when the agent produces broken code?

The quality gates catch it before PR creation. If `pnpm lint` or `pnpm exec tsc --noEmit` fails, the PR creation step fails and the workflow posts an error comment on the issue. The user can reply with corrections, which triggers the clarification loop again. In practice, most agent errors are syntax-level and fixable with a one-sentence correction in the issue thread.

Q: What happens if fal.ai goes down?

Both `generate-image.ts` and `generate-headshot.ts` have fallback models. `flux/dev` falls back to `flux/schnell`. `gpt-image-2` falls back to `flux-pro/kontext`. If both fail, the agent enters the clarification loop and asks the user to provide an image manually. The factory degrades gracefully rather than crashing.

Q: What is the CI Auto-Healer?

`cms-ci-healer.yml` is a workflow that triggers when `ci.yml` or `cms.yml` fails. It checks out the exact failing SHA, downloads the run logs, and invokes Claude Code with a diagnostic prompt. The agent reads the error logs, identifies the root cause, opens a fix PR on a `fix/ci-healer-{run_id}-{sha}` branch, and posts a comment on the original issue. This closes the loop between failure and fix without human intervention.

Q: How does the PR Code Review workflow work?

`pr-code-review.yml` runs on every PR against `main`. It fetches the full diff, invokes Claude Code for a two-phase review: first `/code-review` to find issues, then `/fix` to auto-fix them on the same branch. It skips branches opened by the auto-healer (`fix/ci-healer-*`) or the CMS bot (`claude/*`) to prevent infinite loops. The review comments are posted to the PR, and fixes are pushed as additional commits.

Q: Why does the preview workflow use deployment_status instead of pull_request?

`vercel-preview.yml` listens on `deployment_status` because Vercel previews are triggered by pushes, not PR events. The workflow receives the deployment SHA, looks up the associated PR via `gh api /repos/{repo}/commits/{sha}/pulls`, extracts the `Closes #N` reference from the PR body, and posts a state-specific comment on the source issue. This handles pending, in-progress, success, and failure states with different messages.

Dark Software Factory: How Groupon Runs a Lights-Out CMS on GitHub Issues and PRs

A deep teardown of the production CMS pipeline that turns GitHub Issues into merged PRs while you sleep. 10 workflows, 6 issue types, AI media generation, and the exact DAGs that make it work.

EditorialadvancedMay 12, 202612 min read

By Viktor Bezdek · VP Engineering, Groupon

Here is how I learned that a dark factory is not a thought experiment. It is a weekend project that accidentally becomes production infrastructure.

Groupon had a careers site. It was the kind of site you have seen a thousand times: blue gradients, stock photos of people shaking hands, centered text saying We are passionate about excellence. It said nothing true. It attracted no one worth hiring. The old site lived at grouponcareers.com and it was embarrassing every time a candidate opened it.

I decided to fix it. Not in a quarter. Not with a roadmap. Over a single weekend. The goal was not to ship a perfect site. The goal was to ship a site that told the truth about working at Groupon during an AI transformation, and to build it using the same agentic patterns we were already preaching internally.

The preview lives at groupon-careers.vercel.app. Go look at it. The difference is not just visual. The old site was a declaration of conditions we wished were true. The new one is a declaration of conditions that actually exist: live marketplace pressure, legacy systems, AI workflows in production, real customer weight. It does not say join our journey. It says build while running.

The Old Site: Corporate Slop

Generic blue gradients. Stock photos of handshakes. Text like We are passionate about excellence and join our journey. It said nothing specific enough to be false. It attracted no one who wanted real problems. The site was not broken. It was worse than broken: it was invisible.

The New Site: The Arena

Black background. Amber and copper accents. Sections called The Arena, What AI-Native Means Here, Proof, Not Claims, Builder Profile. It names the exact conditions: legacy systems, real revenue pressure, six AI workflows in production, no clean room. It attracts builders who want leverage, agency, and hard problems. The site was built with Next.js 16, Tailwind 4, MDX content, and a dark factory CMS pipeline that generates spotlights, insights, and location pages from GitHub Issues while the team sleeps.

The build process itself was the experiment. I did not write the content files by hand. I opened GitHub Issues. cms-spotlight-add for employee profiles. cms-insight for leadership articles. cms-location for office pages. The agent read each issue, loaded the correct skills from .claude/skills/, generated headshots via fal-ai/gpt-image-2, wrote the MDX, ran pnpm lint, and opened a PR. I typed /publish on the issue. The PR merged. Vercel deployed.

By Monday morning, the preview site had eight employee spotlights, three insight articles, five location pages, and a full FAQ. None of the content was written by a human sitting at a keyboard. All of it was reviewed by a human before merge. That is the distinction that matters: the factory does not replace judgment. It compresses execution.

What I learned over that weekend became the pipeline this article describes. The BOT_PAT trap, the WAITING/READY signal protocol, the branch detection bug, the recursive comment filter, the media generation fallback chain — every trap in this article was discovered in real time while building the careers preview. The factory taught itself by failing, and I watched the logs.

A dark software factory is a repository where the control plane is not a dashboard or a scheduler. It is the native GitHub surface: Issues for intake, Pull Requests for implementation, Comments for dialogue, and Labels for state machine transitions. AI agents run inside GitHub Actions, reading the same interfaces humans use, producing the same artifacts humans produce, and following the same quality gates humans follow.

The term comes from manufacturing: a dark factory runs without lights because there are no humans on the floor. In software, it means the repository keeps shipping while the team sleeps, attends meetings, or works on harder problems. The humans write specs and design harnesses; the agents write code, generate images, and manage state.

This article is a deep teardown of a production CMS pipeline at Groupon that processes content requests through GitHub Issues. It covers the exact workflow DAGs, the AI configuration surface, the media generation pipeline, the issue type taxonomy, and the specific traps that will break your factory in production. Everything here is drawn from running systems and committed code.

Key Takeaways

✓
GITHUB_TOKEN cannot fire pull_request events on PRs it creates. You need a BOT_PAT with repo scope, or your CI gates and auto-merge will never trigger.
✓
The WAITING/READY signal protocol lets an agent in a sandboxed action communicate state to the workflow orchestrator without file system access or side channels.
✓
Clarification loops require careful actor filtering: exclude bot comments, exclude /publish commands, and never remove the cms-bot-waiting label before PR creation succeeds.
✓
Quality gates must run on the agent's branch (claude/issue-N-*), not the workflow's checkout of main, or you will silently validate the wrong code.
✓
Media generation uses fal-ai/flux/dev for insight headers and fal-ai/gpt-image-2 for spotlight headshots, with monthly cap enforcement via .cms-fal-log.json.
✓
The allowed_tools pattern in anthropics/claude-code-action is a security boundary, not a convenience list. It blocks gh pr create, rm -rf, and any unapproved command.

The Issue Taxonomy: Six Types, Six Skill Loadouts

How the factory classifies work before an agent ever sees it

The factory's job queue is not a free-form inbox. It is a typed taxonomy of six issue labels, each with a strict schema, required fields, and a specific set of skills the agent must load before acting. The label is the routing key. The schema is the contract.

When an issue is opened with a cms-* label, the cms.yml workflow fires. The first thing the agent does is read AGENTS.md and load the skills mapped to that label. This is not optional configuration. It is a mandatory skill-loading protocol that determines whether the agent generates a headshot, expands an article draft, or edits a location page.

Label	Content Type	Required Fields	Primary Skills	Media Pipeline
`cms-spotlight-add`	Employee profile	`name`, `role`, `team`, `quote`, photo, 3 Q&A pairs	`content-engine`, `cms-headshot-gen`, `cms-brand-voice-check`, `ai-slop-cleaner`	`generate-headshot.ts` → `public/images/spotlight/{slug}.png`
`cms-spotlight-edit`	Profile update	Fields to change (partial update)	`content-engine`, `cms-brand-voice-check` (+ `cms-headshot-gen` if photo)	Headshot if new photo attached
`cms-insight`	Leadership article	`title`, `author`, `authorRole`, `body` >= 100 words with factual anchors	`cms-article-expand`, `cms-image-gen`, `cms-brand-voice-check`, `ai-slop-cleaner`	`generate-image.ts` → `public/images/insights/{slug}.png`
`cms-faq`	FAQ entry	`question`, `answer` (2–4 sentences, direct voice)	`content-engine`, `cms-brand-voice-check`	None
`cms-location`	Office page	`city`, `region`, `timezone`, `functions`, `intro`, `highlights`	`content-engine` (download photo if provided)	`download-attachment.ts` only; no generation
`cms-content-edit`	Copy or component change	Target file and desired change	`frontend-design` for components; `cms-brand-voice-check` for copy	None

The taxonomy is rigid for a reason. The agent cannot invent a new issue template. It cannot generate a logo because cms-logo is not a label. It cannot write a component change without the frontend-design skill. The harness is the design: human-authored constraints that prevent the agent from wandering outside its lane.

For spotlights, the quote field has a specific quality gate: it must be specific enough to be false. A generic quote like I love working here fails the audit because it could apply to anyone. The agent must ask for a replacement via the clarification loop. For insights, the body must contain factual anchors — names, dates, tools, outcomes — or the cms-article-expand skill refuses to expand it.

The 11-Workflow DAG: Every Transition, Every Guard

The complete control flow from issue opened to PR merged, including auto-healing and code review

A dark factory is not one massive workflow. It is 11 small workflows that compose into a closed loop. Each workflow owns one state transition, reads one event type, and writes one artifact type. This decomposition is what makes the system debuggable when an agent hallucinates a file path or a label filter misfires.

The 11 workflows are: cms.yml (Issue Intake), issue-clarify.yml (Clarification Loop), cms-ci-healer.yml (CI Auto-Healer), pr-code-review.yml (Code Review + Auto-Fix), cms-publish.yml (Publish/Merge), vercel-preview.yml (Preview Notifications), cms-spend-alert.yml (Cost Monitoring), cms-stale-prs.yml (Stale PR Cleanup), cms-canary.yml (Skill Loading Test), ci.yml (Quality Gates), pr-checks.yml (Coverage Gate), and the implicit build job that Vercel runs on every push.

Factory Control Flow: From Issue to Merge

The complete DAG showing all 11 workflows, their triggers, and the label-based state transitions between them.

Workflow 1: cms.yml — Issue Intake and Agent Execution

The orchestrator that decides whether to implement or wait

cms.yml is the heart of the factory. It triggers on issues.opened and issues.reopened when the issue carries a cms-* label and does not carry cms-bot-waiting. The first gate is the kill switch: vars.CMS_BOT_ENABLED == 'true'. If the repository variable is anything other than 'true', the workflow exits immediately. This is how you halt the entire factory without touching code.

The workflow uses pnpm/action-setup@v4 with Node 20 caching, checks out the repo with fetch-depth: 0, installs the genmedia CLI for image generation, and and then invokes anthropics/claude-code-action@beta. The action is the agent container. Everything that follows depends on three configuration surfaces: model, allowed_tools, and direct_prompt.

cms.yml — Agent Configuration Surface

uses: anthropics/claude-code-action@beta
with:
  anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
  model: claude-sonnet-4-6
  github_token: ${{ secrets.GITHUB_TOKEN }}
  allowed_tools: >
    Edit,MultiEdit,Glob,Grep,LS,Read,Write,
    Bash(git add:*),Bash(git commit:*),Bash(git push:*),
    Bash(git status:*),Bash(git diff:*),
    Bash(pnpm exec tsx scripts/cms/*),
    Bash(pnpm exec tsx scripts/cms/generate-image.ts:*),
    Bash(pnpm exec tsx scripts/cms/generate-headshot.ts:*),
    Bash(node scripts/cms/*),
    Bash(pnpm lint:*),Bash(pnpm exec tsc:*)
  direct_prompt: |
    Read AGENTS.md. Identify the content type from the issue labels.
    Load the matching skill(s) from .claude/skills/ per the skill-selection table.
    If all required fields are present:
      - Generate content, run brand-voice-check, commit.
      - Output the word READY on its own line at the end.
    If fields are missing:
      - Post a comment listing what's needed.
      - Output the word WAITING on its own line at the end.
      - Stop. Do not commit anything.
    DO NOT run gh pr create, gh issue edit, or any label/PR command.

The model field pins the agent to claude-sonnet-4-6. This is not the latest model. It is the calibrated model. The factory was tested against this specific version, and changing it without re-calibrating the allowed_tools patterns and prompt instructions risks breaking the agent's behavior.

The allowed_tools field is the security boundary. It uses a comma-separated list where each Bash command is prefixed with Bash(...) and supports wildcard arguments. Bash(git add:*) allows git add content/spotlights/jana-novotna.mdx but Bash(git add:*) does not allow git add . && rm -rf .git because the action treats the entire string as a single command and rejects &&. This is critical: the action's Bash tool does not support command chaining, pipes, or semicolons. Every Bash call must be a single executable with literal arguments.

The allowed_tools list includes Bash(pnpm exec tsx scripts/cms/*) for cap-checking (cap-check.ts) and image postprocessing (postprocess-image.ts), and Bash(genmedia run:*) for media generation via the genmedia CLI. The direct_prompt is the control surface. It is not a suggestion. It is a script that the agent executes line by line. The prompt ends with an explicit signal instruction: READY or WAITING on its own line. The workflow reads this signal from the action's output JSON file.

cms.yml — Signal Extraction

- name: Add cms-bot-waiting label if bot is waiting
  run: |
    RESULT=$(tail -1 /home/runner/work/_temp/claude-execution-output.json \
      | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('result',''))")
    if echo "$RESULT" | grep -qi 'WAITING'; then
      gh issue edit "${{ github.event.issue.number }}" --add-label "cms-bot-waiting"
    fi
  env:
    GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Workflow 2: issue-clarify.yml — The Clarification Loop

How the factory handles incomplete specs without human intervention

The clarification loop is where most dark factory implementations break. A user opens an issue, the agent asks for a photo, the user uploads one in a comment, and now a second workflow must re-run the agent with the accumulated context. The trigger is issue_comment.created, but the filter is the hardest part of the entire system. The pipeline now also filters on github.event.actor != 'github-actions[bot]' as a second guard against recursive triggering.

You must exclude: bot comments (to prevent recursive self-triggering), /publish commands (handled by cms-publish.yml), /cancel commands, comments on pull requests (not issues), and issues that do not have both cms-bot-waiting and a cms-* label. The Groupon pipeline uses a join-based prefix match for label filtering because GitHub Actions' contains() on arrays does exact matching only. contains(join(labels.*.name, ' '), 'cms-') checks whether any label starts with cms-, which is the correct behavior for a prefix-based taxonomy.

issue-clarify.yml — Actor and Label Filter

if: |
  vars.CMS_BOT_ENABLED == 'true'
  && !github.event.issue.pull_request
  && github.event.comment.user.login != 'github-actions[bot]'
  && contains(join(github.event.issue.labels.*.name, ' '), 'cms-bot-waiting')
  && contains(join(github.event.issue.labels.*.name, ' '), 'cms-')
  && !startsWith(github.event.comment.body, '/publish')
  && !startsWith(github.event.comment.body, '/cancel')

The clarification workflow has a pre-download step that runs before the agent. It extracts photo URLs from the comment thread using gh issue view --comments --json comments, downloads them with curl and GITHUB_TOKEN auth, and writes them to public/images/spotlight/{slug}.jpg. This solves two sandbox problems at once: the agent cannot access /tmp, and GitHub attachment URLs expire after a short time. By downloading images in a workflow step, the photos are available in the repo workspace when the agent starts.

A critical bug that appeared during calibration: removing the cms-bot-waiting label before PR creation. If the label is removed early and gh pr create fails, the issue no longer has the waiting label, so issue-clarify.yml will not re-trigger on the user's next reply. The issue is stranded. The fix is to remove the label only after successful PR creation, in the success path of the same step that creates the PR.

Clarification Loop State Machine

The exact state transitions for the clarification loop. The label is the state. The comment is the input. The agent's signal is the output.

Workflow 3: cms-ci-healer.yml — The Auto-Healer

When CI fails, the factory heals itself

The CI Auto-Healer is the most recent addition to the factory. It triggers on workflow_run.completed when the conclusion is failure, watching CI and CMS — Process content request workflows. It downloads the run log archive via gh api repos/{repo}/actions/runs/{id}/logs, extracts the last 200 error lines, and invokes Claude Code on the exact failing SHA (not main).

The agent receives the error context and a diagnostic prompt. It creates a fix/ci-healer-{run_id}-{sha} branch, pushes it, applies the fix, and opens a PR. The pr-code-review.yml workflow explicitly skips fix/ci-healer-* branches to prevent the review loop from re-triggering on its own healing PRs.

cms-ci-healer.yml — Trigger and Log Fetch

on:
  workflow_run:
    workflows:
      - CI
      - CMS — Process content request
    types: [completed]

jobs:
  heal:
    if: github.event.workflow_run.conclusion == 'failure'
    steps:
      - uses: actions/checkout@v4.2.2
        with:
          ref: ${{ github.event.workflow_run.head_sha }}
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4.4.0
        with:
          node-version: '20'
          cache: 'pnpm'
      - name: Fetch failing run logs
        run: |
          gh api "repos/${REPO}/actions/runs/${RUN_ID}/logs" > /tmp/logs.zip
          unzip /tmp/logs.zip -d /tmp/run-logs
          grep -h "error\|Error\|FAIL" /tmp/run-logs/*.txt | tail -200 > /tmp/errors.txt
      - name: Run Claude Code to diagnose and fix
        uses: anthropics/claude-code-action@beta
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          model: claude-sonnet-4-6
          allowed_tools: >-
            Edit,MultiEdit,Glob,Grep,LS,Read,Write,
            Bash(git add:*),Bash(git commit:*),Bash(git push:*),
            Bash(pnpm lint:*),Bash(pnpm exec tsc:*)
          direct_prompt: |
            Analyze the CI failure logs at /tmp/errors.txt.
            Identify the root cause. Create a fix on branch fix/ci-healer-{run_id}-{sha}.
            Run pnpm lint and pnpm exec tsc --noEmit before committing.

Workflow 4: pr-code-review.yml — Review and Auto-Fix

Every PR gets a Claude review before human eyes see it

The PR Code Review workflow runs on every pull request against main (opened, synchronize, reopened). It fetches the full diff with git diff base...head, pipes it to Claude Code for a structured code review, and then — if the review finds must_fix or should_fix issues — invokes a second pass to auto-fix them on the same branch.

The workflow has a critical guard: it skips branches starting with fix/ci-healer- or claude/. Without this, a bot-fix PR would trigger another review, which might trigger another fix, creating an infinite loop. The allowed_tools list is narrower than the CMS workflows: it excludes media generation and headshot scripts, focusing only on lint, typecheck, and test tools.

pr-code-review.yml — Guard and Diff

on:
  pull_request:
    branches: [main]
    types: [opened, synchronize, reopened]

jobs:
  review-and-fix:
    if: |
      !startsWith(github.head_ref, 'fix/ci-healer-') &&
      !startsWith(github.head_ref, 'claude/')
    steps:
      - uses: actions/checkout@v4.2.2
        with:
          fetch-depth: 0
          ref: ${{ github.event.pull_request.head.sha }}
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4.4.0
        with:
          node-version: '20'
          cache: 'pnpm'
      - name: Fetch diff for review
        run: |
          git diff "${BASE_SHA}...${HEAD_SHA}" > /tmp/pr.diff
          echo "DIFF_SIZE=$(wc -c < /tmp/pr.diff)" >> "$GITHUB_ENV"
      - name: Code review + auto-fix
        uses: anthropics/claude-code-action@beta
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          model: claude-sonnet-4-6
          allowed_tools: >-
            Edit,MultiEdit,Glob,Grep,LS,Read,Write,
            Bash(git add:*),Bash(git commit:*),Bash(git push:*),
            Bash(pnpm lint:*),Bash(pnpm exec tsc:*),Bash(pnpm test:*)
          direct_prompt: |
            Review the diff at /tmp/pr.diff.
            Phase 1: /code-review — find must_fix and should_fix issues.
            Phase 2: If issues found, /fix them on this branch.
            Run pnpm lint and pnpm exec tsc --noEmit after fixes.

Workflow 5: PR Creation — The BOT_PAT Problem

Why GITHUB_TOKEN is not enough, and how to work around it

Here is the single most expensive trap in dark factory engineering. When a GitHub Actions workflow creates a PR using the default GITHUB_TOKEN, that PR does not fire pull_request events. This means your ci.yml, your pr-checks.yml, your required status checks, and your auto-merge rules will never trigger. The PR sits open, green in the sense that no checks failed, but unmergeable because no checks ran.

The fix is a Personal Access Token with repo scope, stored as BOT_PAT. PRs created with BOT_PAT fire events normally because GitHub treats them as user-created, not action-created. The tradeoff is that you must rotate this token and scope it to the minimum repositories.

In the Groupon pipeline, BOT_PAT is used only for gh pr create. All subsequent operations — labeling, commenting, merging — use GITHUB_TOKEN so the comments are attributed to github-actions[bot] and do not re-trigger issue workflows.

GITHUB_TOKEN-created PR

Does not fire pull_request events
CI workflows and status checks never trigger
Required checks show as Expected but never run
Auto-merge rules cannot evaluate
PR remains unmergeable indefinitely

BOT_PAT-created PR

Fires pull_request events normally
CI workflows, status checks, and required checks trigger automatically
Branch protection rules evaluate correctly
Auto-merge rules can proceed on green CI
Human-like event semantics with machine execution

The PR creation step uses gh pr merge --squash --admin for same-day merges when the cms-approved label is applied. This bypasses the 24-hour branch protection delay for bot PRs that have passed all quality gates. The step also runs quality gates: pnpm lint and pnpm exec tsc --noEmit. These must run on the agent's branch, not on the workflow's checkout of main. A common latent bug is checking out main, then running lint against main while the agent's commits sit on an undetected branch. The fix is to git checkout the action branch after detecting it with git branch -r | grep origin/claude/issue-N-....

The PR body includes a cost report generated by scripts/cms/cost-report.ts. This script reads .cms-fal-log.json for fal.ai spend and estimates Anthropic cost from token counts or issue body length. Every automated PR carries its own price tag in the description.

Workflow 6: cms-publish.yml — Human Approval as a Label

How /publish becomes cms-approved becomes squash-merged

The publish workflow has two jobs that communicate through labels, not through shared state or message queues. Job 1 listens for /publish comments on issues, finds the associated bot PR by searching for Closes #N in PR bodies, verifies the PR has the bot-generated label, and adds cms-approved. Job 2 listens for pull_request.labeled events where the label is cms-approved and the PR has the bot-generated label, and immediately squash-merges with gh pr merge --squash --admin.

This label-based coordination is robust against failure. If the associated PR is not bot-generated (missing the bot-generated label), Job 1 replies with a warning and exits. If the user types /publish before the bot has opened a PR, Job 1 replies with a helpful message and exits cleanly. If gh pr create failed but the waiting label was already removed, the label-based trigger would not fire because the PR does not exist. The state machine degrades gracefully: the user replies again, the clarification loop re-triggers, and the system recovers.

cms-publish.yml — Merge Job

auto-merge:
  if: |
    github.event_name == 'pull_request'
    && github.event.action == 'labeled'
    && github.event.label.name == 'cms-approved'
    && contains(join(github.event.pull_request.labels.*.name, ' '), 'bot-generated')
  steps:
    - name: Merge PR (squash)
      env:
        GH_TOKEN: ${{ secrets.BOT_PAT }}
      run: gh pr merge "${{ github.event.pull_request.number }}" --squash --admin

Workflow 7: Media Generation Pipeline

How fal.ai, flux, and gpt-image-2 generate assets inside a CI job

The factory does not just write code. It generates images. Two distinct pipelines handle media: insight article headers and spotlight headshots. Both use fal.ai, both enforce a monthly spend cap, and both exit with structured JSON that the workflow parses.

The insight header pipeline lives in scripts/cms/generate-image.ts. It uses fal-ai/flux/dev as the primary model and fal-ai/flux/schnell as the fallback. The prompt is appended with . No text, no logos. to prevent the model from rendering watermarks or typography. The output is resized to 1200×630 via sharp, EXIF-stripped, and written atomically to public/images/insights/{slug}.png.

The headshot pipeline lives in scripts/cms/generate-headshot.ts. It uses fal-ai/gpt-image-2 as the primary model (image-to-image conditioning) and fal-ai/flux-pro/kontext as the fallback. The input photo is base64-encoded and passed as image_url. The canonical STYLE_PROMPT is professional headshot, light neutral background, soft fill lighting, square crop 1:1, no text, photographic style. The output is 1024×1024, EXIF-stripped, and written to public/images/spotlight/{slug}.png.

generate-image.ts — Model Selection and Cap Enforcement

const PRIMARY_MODEL = 'fal-ai/flux/dev';
const FALLBACK_MODEL = 'fal-ai/flux/schnell';

export async function checkCap(currentSpendUsd: number, capUsd: number): Promise<void> {
  if (currentSpendUsd >= capUsd * 0.9) {
    throw new Error(
      `Monthly fal.ai spend ($${currentSpendUsd.toFixed(2)}) is at or above 90% of the cap ($${capUsd}). Image generation paused.`
    );
  }
}

export async function generateInsightImage(slug: string, prompt: string): Promise<string> {
  const cap = parseFloat(process.env.FAL_MONTHLY_CAP_USD ?? '50');
  const currentSpend = readCurrentMonthSpend();
  await checkCap(currentSpend, cap);

  fal.config({ credentials: process.env.FAL_KEY });
  const fullPrompt = `${prompt}. No text, no logos.`;

  try {
    const result = await fal.run(PRIMARY_MODEL, {
      input: {
        prompt: fullPrompt,
        image_size: 'landscape_16_9',
        num_inference_steps: 28,
        guidance_scale: 3.5,
        num_images: 1,
      },
    });
    imageUrl = result.images[0].url;
  } catch (primaryErr) {
    const fallback = await fal.run(FALLBACK_MODEL, {
      input: { prompt: fullPrompt, image_size: 'landscape_16_9', num_inference_steps: 4 },
    });
    imageUrl = fallback.images[0].url;
  }

  const processedBuffer = await sharp(imageBuffer).resize(1200, 630).png().toBuffer();
  // atomic write: tmp → rename
  logGeneration(0.09);
}

Both pipelines share a cost logging mechanism: .cms-fal-log.json. It tracks requests and costUsd per month. The checkCap function throws when spend reaches 90% of FAL_MONTHLY_CAP_USD. This is not a soft warning. It is a hard stop that prevents the agent from burning the monthly budget on a runaway generation loop.

The download-attachment.ts script handles location photos and user-uploaded spotlight photos. It validates MIME type (image/*), enforces a 10 MB streaming byte cap, guards against path traversal in the slug, strips EXIF via sharp.rotate().png(), and writes atomically using a temp file and fs.renameSync. It also checks that the output path is not a symlink or special file before overwriting.

Workflow 8: Cost Governance

How the factory tracks and controls inference spend

Cost governance in a dark factory is not an afterthought. It is a first-class concern because the dominant cost is LLM inference, and runaway agents can burn through a daily budget before anyone notices. The Groupon pipeline has three cost control layers: per-PR cost reporting, daily spend alerts, and monthly media caps.

scripts/cms/cost-report.ts generates a markdown cost block for every PR. It reads .cms-fal-log.json for fal.ai spend and estimates Anthropic cost using Claude Opus pricing: $15/MTok input, $75/MTok output. If the action does not expose per-run token counts (which claude-code-action@beta does not as of the latest calibration), the script falls back to estimating from ISSUE_BODY_LENGTH: ~4 characters per token, plus a 2,000 token base for context, with output assumed at 30% of input.

scripts/cms/check-daily-spend.ts runs on a cron at 07:00 UTC (after stale PR cleanup at 06:00 UTC). It estimates yesterday's total spend from .cms-fal-log.json and .cms-token-log.json, compares it against DAILY_SPEND_THRESHOLD_USD (default 50), and opens a GitHub issue with label ops-cost-alert if the threshold is exceeded. The issue body includes a JSON breakdown and instructions to set CMS_BOT_ENABLED=false to pause the factory.

cost-report.ts — Pricing Constants

const ANTHROPIC_INPUT_COST_PER_TOKEN = 15 / 1_000_000; // $15/MTok
const ANTHROPIC_OUTPUT_COST_PER_TOKEN = 75 / 1_000_000; // $75/MTok
const FAL_COST_PER_REQUEST = 0.09;

export function estimateCost(params: { inputTokens: number; outputTokens: number }): number {
  return params.inputTokens * ANTHROPIC_INPUT_COST_PER_TOKEN +
         params.outputTokens * ANTHROPIC_OUTPUT_COST_PER_TOKEN;
}

export function formatCostReport(data: CostData): string {
  const total = data.anthropicCostUsd + data.falCostUsd;
  return `### Cost\n\n- Anthropic: ~$${data.anthropicCostUsd.toFixed(2)}` +
         ` (${data.anthropicInputTokens.toLocaleString()} input, ${data.anthropicOutputTokens.toLocaleString()} output)\n` +
         `- fal.ai: $${data.falCostUsd.toFixed(2)} (${data.falRequests} request${data.falRequests === 1 ? '' : 's'})\n` +
         `- Total: $${total.toFixed(2)}`;
}

Workflow 9: Production Traps

What breaks when you run this at scale, and the exact fix

Sandbox Escape

Agents with unrestricted Bash access can push to main, delete branches, or leak secrets. Use explicit allowed_tools patterns with no wildcards on dangerous commands. The action rejects &&, |, and ; in Bash calls.

Branch Detection

The action commits to claude/issue-N-* but the workflow checks out main. Always detect the action branch with git branch -r and git checkout it before running quality gates.

Cost Visibility

Every agent run costs tokens. Generate a cost report from the execution log and append it to the PR body so approvers see the price before merging. Use .cms-fal-log.json for media and .cms-token-log.json for inference.

Recursive Triggering

Bot comments can trigger the clarification loop if the actor filter is wrong. Exclude github-actions[bot], claude-code-action[bot], and any other app actors explicitly. Use github.event.comment.user.login not github.event.actor.

Stale PRs

Bot PRs that fail CI and never get fixed accumulate. cms-stale-prs.yml uses actions/stale@v9 with only-pr-labels: bot-generated and days-before-pr-stale: 7. It closes the PR and re-opens the source issue.

Workflow files for a full factory

Intake, clarification, publish, preview, spend, stale, canary, CI, PR checks, CI healer, and code review

30+

Commits in a single day of calibration

The Groupon CMS pipeline required 30+ iterative commits to stabilize the clarification loop and BOT_PAT behavior

$0.02–$0.50

Cost per issue processed

Claude Code agent cost per content request, measured via cost-report.ts and appended to each PR body

$0.09

Cost per image generated

fal.ai cost per request for gpt-image-2 or flux/dev, logged in .cms-fal-log.json; genmedia CLI abstracts provider switching

Dark Factory Production Readiness Checklist

What to verify before calling it lights-out

Production Readiness Checklist

BOT_PAT secret configured with repo scope and correct repository access
GITHUB_TOKEN-created PRs confirmed to NOT fire pull_request events (tested)
BOT_PAT-created PRs confirmed to fire CI workflows and required checks
Agent allowed_tools list excludes gh pr create, gh issue edit, and rm -rf
WAITING/READY signal protocol tested with both missing and complete issues
Clarification loop filters exclude bot comments, /publish, /cancel, and PR comments
cms-bot-waiting label is removed ONLY after successful PR creation, never before
Quality gates run on the agent's branch, not the workflow's checkout of main
Cost report generated and appended to every bot PR body
Kill switch repository variable (CMS_BOT_ENABLED) tested and documented
cms-stale-prs.yml closes unmerged bot PRs after 7 days and re-opens source issues
cms-spend-alert.yml warns when daily agent costs exceed threshold
cms-ci-healer.yml triggers on workflow_run.completed with conclusion=failure, checks out the failing SHA, and opens a fix PR
pr-code-review.yml skips bot branches (fix/ci-healer-*, claude/*) to prevent infinite review-fix loops
vercel-preview.yml uses deployment_status event and looks up the source issue by PR SHA, not by label
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 is set in all workflow env blocks to force the action runtime
generate-image.ts and generate-headshot.ts enforce 90% monthly cap via .cms-fal-log.json
download-attachment.ts validates MIME type, enforces 10MB cap, and strips EXIF
vercel-preview.yml posts preview URLs to source issues on deployment success

Can I use this without Claude Code?

Yes. The architecture is independent of the agent implementation. You could use OpenAI's Codex CLI, GitHub Copilot's agent mode, or a custom LLM client inside the action. The critical interfaces are the same: the agent reads a prompt, emits files, runs git commands, and signals READY or WAITING. What changes is the action wrapper and the allowed_tools pattern.

How do I prevent the agent from deleting production data?

Three layers: First, the allowed_tools pattern blocks dangerous Bash commands. Second, branch protection rules on main prevent direct pushes. Third, the agent never runs on main; it always commits to a feature branch. Even if the agent went rogue, it could only delete files on a branch that gets reviewed before merge.

What happens when the agent produces broken code?

The quality gates catch it before PR creation. If pnpm lint or pnpm exec tsc --noEmit fails, the PR creation step fails and the workflow posts an error comment on the issue. The user can reply with corrections, which triggers the clarification loop again. In practice, most agent errors are syntax-level and fixable with a one-sentence correction in the issue thread.

How much does this cost to run?

The dominant cost is LLM inference. A typical content request costs $0.02–$0.50 in Claude API tokens depending on complexity. Image generation via fal.ai adds $0.09 per request. GitHub Actions minutes are negligible for private repos (2000 minutes/month free). The cost-report.ts script tracks per-PR spend so you can set monthly caps via FAL_MONTHLY_CAP_USD and daily alerts via DAILY_SPEND_THRESHOLD_USD.

Can multiple agents run in parallel?

Yes, but they need isolation. The Groupon pipeline uses Docker sandboxing via the Claude Code action. For broader multi-agent patterns — where one agent implements, another reviews, and a third merges — see the Dark Factory CLI by Peter Stratton, which orchestrates three independent Claude Code instances with separate permissions.

What happens if fal.ai goes down?

Both generate-image.ts and generate-headshot.ts have fallback models. flux/dev falls back to flux/schnell. gpt-image-2 falls back to flux-pro/kontext. If both fail, the agent enters the clarification loop and asks the user to provide an image manually. The factory degrades gracefully rather than crashing.

What is the CI Auto-Healer?

cms-ci-healer.yml is a workflow that triggers when ci.yml or cms.yml fails. It checks out the exact failing SHA, downloads the run logs, and invokes Claude Code with a diagnostic prompt. The agent reads the error logs, identifies the root cause, opens a fix PR on a fix/ci-healer-{run_id}-{sha} branch, and posts a comment on the original issue. This closes the loop between failure and fix without human intervention.

How does the PR Code Review workflow work?

pr-code-review.yml runs on every PR against main. It fetches the full diff, invokes Claude Code for a two-phase review: first /code-review to find issues, then /fix to auto-fix them on the same branch. It skips branches opened by the auto-healer (fix/ci-healer-*) or the CMS bot (claude/*) to prevent infinite loops. The review comments are posted to the PR, and fixes are pushed as additional commits.

Why does the preview workflow use deploymentstatus instead of pullrequest?

vercel-preview.yml listens on deployment_status because Vercel previews are triggered by pushes, not PR events. The workflow receives the deployment SHA, looks up the associated PR via gh api /repos/{repo}/commits/{sha}/pulls, extracts the Closes #N reference from the PR body, and posts a state-specific comment on the source issue. This handles pending, in-progress, success, and failure states with different messages.

A dark software factory is not about removing humans from software development. It is about removing humans from the parts that do not need them: the translation from spec to branch, from branch to PR, from PR to merge, and from merge to deployed preview. The human still writes the spec, still approves the merge with /publish, and still designs the harness that constrains the agent via AGENTS.md and allowed_tools.

What changes is the latency. A content request that once sat in a backlog for two days now ships in twenty minutes. A bug report that once required finding an available engineer now generates a fix while the reporter is still writing the reproduction steps. The factory does not replace judgment; it compresses execution.

The 11 workflow files in this article are under 700 lines combined. The complexity is not in the code; it is in the edge cases. The BOT_PAT behavior, the recursive trigger prevention, the branch detection timing, the WAITING/READY signal protocol, and the media generation fallback chains are all learned from failed runs, not from documentation. Build the simple version first, run it for a week, and let the failures teach you what to guard against. The factory that cannot fail gracefully is not a factory. It is a bomb.

Key terms in this piece

dark software factory github actionsagent team github issues prsautomated code review github actionsclaude code github actions workflowlights out software developmentgithub bot pr merge automationfal ai image generation pipelinecms bot github actions

Sources

[1]Dark Factory — Go CLI for autonomous agent orchestration (Peter Stratton, 2026)(github.com)↩
[2]Claude Software Factory — 6 GitHub Actions workflows for self-running repos (Grey Newell, 2026)(github.com)↩
[3]Dark Factory Documentation(godarkfactory.com)↩
[4]Dark Factory GitHub Copilot CLI skill with sealed testing (DUBSOpenHub, 2026)(github.com)↩
[5]Software Factory — FastAPI-based issue/PR autonomous development system (Sun Praise, 2026)(github.com)↩

Share this article

X LinkedIn Hacker News

Dark Software Factory: How Groupon Runs a Lights-Out CMS on GitHub Issues and PRs

A deep teardown of the production CMS pipeline that turns GitHub Issues into merged PRs while you sleep. 10 workflows, 6 issue types, AI media generation, and the exact DAGs that make it work.

EditorialadvancedMay 12, 202612 min read

By Viktor Bezdek · VP Engineering, Groupon

Label

Content Type

Required Fields

Primary Skills

Media Pipeline

cms-spotlight-add

Employee profile

name, role, team, quote, photo, 3 Q&A pairs

content-engine, cms-headshot-gen, cms-brand-voice-check, ai-slop-cleaner

generate-headshot.ts → public/images/spotlight/{slug}.png

cms-spotlight-edit

Profile update

Fields to change (partial update)

content-engine, cms-brand-voice-check (+ cms-headshot-gen if photo)

Headshot if new photo attached

cms-insight

Leadership article

title, author, authorRole, body >= 100 words with factual anchors

cms-article-expand, cms-image-gen, cms-brand-voice-check, ai-slop-cleaner

generate-image.ts → public/images/insights/{slug}.png

cms-faq

FAQ entry

question, answer (2–4 sentences, direct voice)

content-engine, cms-brand-voice-check

None

cms-location

Office page

city, region, timezone, functions, intro, highlights

content-engine (download photo if provided)

download-attachment.ts only; no generation

cms-content-edit

Copy or component change

Target file and desired change

frontend-design for components; cms-brand-voice-check for copy

None

uses: anthropics/claude-code-action@beta with: anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} model: claude-sonnet-4-6 github_token: ${{ secrets.GITHUB_TOKEN }} allowed_tools: > Edit,MultiEdit,Glob,Grep,LS,Read,Write, Bash(git add:*),Bash(git commit:*),Bash(git push:*), Bash(git status:*),Bash(git diff:*), Bash(pnpm exec tsx scripts/cms/*), Bash(pnpm exec tsx scripts/cms/generate-image.ts:*), Bash(pnpm exec tsx scripts/cms/generate-headshot.ts:*), Bash(node scripts/cms/*), Bash(pnpm lint:*),Bash(pnpm exec tsc:*) direct_prompt: | Read AGENTS.md. Identify the content type from the issue labels. Load the matching skill(s) from .claude/skills/ per the skill-selection table. If all required fields are present: - Generate content, run brand-voice-check, commit. - Output the word READY on its own line at the end. If fields are missing: - Post a comment listing what's needed. - Output the word WAITING on its own line at the end. - Stop. Do not commit anything. DO NOT run gh pr create, gh issue edit, or any label/PR command.

- name: Add cms-bot-waiting label if bot is waiting run: | RESULT=$(tail -1 /home/runner/work/_temp/claude-execution-output.json \ | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('result',''))") if echo "$RESULT" | grep -qi 'WAITING'; then gh issue edit "${{ github.event.issue.number }}" --add-label "cms-bot-waiting" fi env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

if: | vars.CMS_BOT_ENABLED == 'true' && !github.event.issue.pull_request && github.event.comment.user.login != 'github-actions[bot]' && contains(join(github.event.issue.labels.*.name, ' '), 'cms-bot-waiting') && contains(join(github.event.issue.labels.*.name, ' '), 'cms-') && !startsWith(github.event.comment.body, '/publish') && !startsWith(github.event.comment.body, '/cancel')

on: workflow_run: workflows: - CI - CMS — Process content request types: [completed] jobs: heal: if: github.event.workflow_run.conclusion == 'failure' steps: - uses: actions/checkout@v4.2.2 with: ref: ${{ github.event.workflow_run.head_sha }} - uses: pnpm/action-setup@v4 - uses: actions/setup-node@v4.4.0 with: node-version: '20' cache: 'pnpm' - name: Fetch failing run logs run: | gh api "repos/${REPO}/actions/runs/${RUN_ID}/logs" > /tmp/logs.zip unzip /tmp/logs.zip -d /tmp/run-logs grep -h "error\|Error\|FAIL" /tmp/run-logs/*.txt | tail -200 > /tmp/errors.txt - name: Run Claude Code to diagnose and fix uses: anthropics/claude-code-action@beta with: anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} model: claude-sonnet-4-6 allowed_tools: >- Edit,MultiEdit,Glob,Grep,LS,Read,Write, Bash(git add:*),Bash(git commit:*),Bash(git push:*), Bash(pnpm lint:*),Bash(pnpm exec tsc:*) direct_prompt: | Analyze the CI failure logs at /tmp/errors.txt. Identify the root cause. Create a fix on branch fix/ci-healer-{run_id}-{sha}. Run pnpm lint and pnpm exec tsc --noEmit before committing.

on: pull_request: branches: [main] types: [opened, synchronize, reopened] jobs: review-and-fix: if: | !startsWith(github.head_ref, 'fix/ci-healer-') && !startsWith(github.head_ref, 'claude/') steps: - uses: actions/checkout@v4.2.2 with: fetch-depth: 0 ref: ${{ github.event.pull_request.head.sha }} - uses: pnpm/action-setup@v4 - uses: actions/setup-node@v4.4.0 with: node-version: '20' cache: 'pnpm' - name: Fetch diff for review run: | git diff "${BASE_SHA}...${HEAD_SHA}" > /tmp/pr.diff echo "DIFF_SIZE=$(wc -c < /tmp/pr.diff)" >> "$GITHUB_ENV" - name: Code review + auto-fix uses: anthropics/claude-code-action@beta with: anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }} model: claude-sonnet-4-6 allowed_tools: >- Edit,MultiEdit,Glob,Grep,LS,Read,Write, Bash(git add:*),Bash(git commit:*),Bash(git push:*), Bash(pnpm lint:*),Bash(pnpm exec tsc:*),Bash(pnpm test:*) direct_prompt: | Review the diff at /tmp/pr.diff. Phase 1: /code-review — find must_fix and should_fix issues. Phase 2: If issues found, /fix them on this branch. Run pnpm lint and pnpm exec tsc --noEmit after fixes.

auto-merge: if: | github.event_name == 'pull_request' && github.event.action == 'labeled' && github.event.label.name == 'cms-approved' && contains(join(github.event.pull_request.labels.*.name, ' '), 'bot-generated') steps: - name: Merge PR (squash) env: GH_TOKEN: ${{ secrets.BOT_PAT }} run: gh pr merge "${{ github.event.pull_request.number }}" --squash --admin