Engineering directors burn 45 minutes every morning reconstructing a picture five tools could have assembled. Replace the loop: five parallel collectors, one orchestrator, a confidence score, a 90-second RED/AMBER/GREEN brief. Triage out of working memory, into code.
Forty-five minutes. That is what most engineering directors pay every morning before standup, walking the same loop: Jira, then GitHub, then Slack, then Asana, then a spreadsheet someone pinned in a chat room three weeks ago. The output is a mental model of what needs intervention. Half of what gets flagged is fine.
That loop is a manual triage pipeline running on the most expensive infrastructure in the org. It is also the wrong shape. Reading dashboards trains you to pattern-match against yesterday's picture. The signals that matter are the ones that changed overnight, and the human eye is bad at deltas.
The replacement is structural. Five collector subagents fan out across five systems in parallel. An orchestrator pulls the payloads, cross-references them, scores confidence, and emits a 90-second brief tagged RED, AMBER, GREEN. Total wall-clock: under a minute. The pipeline finishes before the laptop opens.
For a director scanning five or more tools, the time recovered lands somewhere between 30 and 60 minutes a day.[3] The actual gain depends on how many surfaces you currently check and how disciplined the team is about keeping them current — claim the time savings only after you measure them. The bigger payoff is harder to count: signals that crossed system boundaries overnight no longer require a human to notice them.
Why six dashboards make triage worse, not better — and the structural reason
The five-collector / one-orchestrator architecture with exact API methods per source
Confidence scoring mechanics: normalization, cross-reference correlation, RED/AMBER/GREEN thresholds
A working run.sh that fans out all collectors in parallel and invokes the orchestrator once
Weekly calibration loop to keep false-positive rate in the 10–15% band
Production failure modes: API timeouts, token budgets, rate limits, and how to handle each
Pre-launch checklist and FAQ for the questions engineering leaders actually ask
What a leader needs is a decision, not a screen. Every additional surface widens the gap.
Dashboards were designed for SOC analysts and on-call engineers — people whose job is to stare at the screen. Engineering leaders are not those people. They context-switch between hiring, architecture review, stakeholder management, and the next quarter's plan. A dashboard demands continuous attention. A leader needs a processed conclusion.
The 2025 SANS Detection & Response Survey clocks 46% of all alerts as false positives.[1] Engineering operations track close to that number. Stale PRs blocked on a design call. "Blocked" Jira tickets nobody bothered to close. Asana tasks marked overdue that the team intentionally deprioritized. Signal-to-noise is bad on every surface, and it compounds across them.
The average org runs eight observability tools.[2] For a director overseeing 40 to 80 engineers across squads, that is eight tabs, eight notification channels, eight mental models — all kept warm in the background while real work happens. The cost is not the time per tool. The cost is the cognitive overhead of holding the union of all of them in working memory before the first 1:1 of the day.
Context-switching between tools is the number-three productivity killer for developers, according to Atlassian's survey of 3,500 engineers.[7] For engineering leaders the number is structural: every manual triage loop is a pre-meeting scramble plus the cognitive cost of rebuilding a mental model from scratch. The morning dashboard ritual does not make better decisions. It delays them.
Open Jira, filter by team, scan for blocked tickets
Switch to GitHub, hunt PR age and review queues
Open Asana, audit 5/15 report compliance by hand
Scroll Google Chat for unanswered escalations
Pull up the business metrics sheet someone pinned
Stitch a mental picture across five surfaces
Miss the signal that crossed two systems overnight
45–60 minutes, inconsistent coverage, fatigue compounding
Five collector subagents query all sources in parallel
Orchestrator normalizes payloads and cross-references signals
Confidence scoring separates noise from a real fight
RED/AMBER/GREEN with source attribution and a suggested move
Brief lands before standup; reading it takes 90 seconds
Weekly calibration loop pulls false positives back into 10–15%
Catches the multi-source correlations a single tab cannot show
Same coverage every morning, regardless of how rough yesterday was
Collection fans out. Judgment lives in one place. Coupling is the enemy.
Three layers, sharply separated:
Layer 1 — Collectors run as parallel Claude Code subagents. Each one talks to exactly one system and returns a normalized signal payload. They run simultaneously, so total collection time equals the slowest API response (usually 3–8 seconds), not the sum of all five.[5]
Layer 2 — Orchestrator receives the five payloads, cross-references them (a blocked Jira ticket and a stale PR on the same feature is worse than either alone), runs a confidence scoring pass, and assigns RED/AMBER/GREEN.
Layer 3 — Output formats the brief as a concise summary delivered to a chosen channel — email, Slack, or a pinned Google Doc that updates daily.
The coupling rule is non-negotiable: collectors know nothing about each other. Adding a sixth source means writing one new collector file, not refactoring the orchestrator. The system that resists growth is not a radar — it is a project liability.
treesignal-radar/
├── collectors/
│ ├── jira-collector.md
│ ├── github-collector.md
│ ├── asana-collector.md
│ ├── gchat-collector.md
│ └── sheets-collector.md
├── orchestrator/
│ ├── scoring-rules.md
│ └── orchestrate.md
├── config/
│ ├── thresholds.json
│ ├── sources.json
│ └── output-template.md
├── logs/
│ ├── false-positives.jsonl
│ └── calibration-history.json
├── run.sh
└── CLAUDE.mdIsolation is the leverage point. It is what lets the system grow without the orchestrator carrying the cost.
Each collector subagent is a markdown prompt file Claude Code loads as a task. The design constraint is strict: each collector knows nothing about the others. It queries one API, extracts the signals that matter, and returns a standardized JSON payload. That isolation is what makes the system extend cleanly — adding a sixth source is one new collector file, not an orchestrator rewrite.
What each collector pulls:
| Collector | System | Key Signals | API Method |
|---|---|---|---|
| Jira Collector | Jira Cloud | Blocked ticket count, P1/P0 incidents, sprint burndown deviation, tickets stuck >3 days | REST API v3 with JQL |
| GitHub Collector | GitHub | PR age >48h, review bottlenecks (PRs with 0 reviews), failed CI runs on main, deploy frequency delta | GraphQL API + REST checks |
| Asana Collector | Asana | 5/15 report submission rate, overdue milestones, tasks without assignees, project health drift | Asana REST API |
| Chat Collector | Google Chat | Unanswered threads >4h in key spaces, escalation keywords, unresolved questions from direct reports | Google Chat API |
| Metrics Collector | Google Sheets | Business KPIs vs. targets (revenue, churn, NPS), week-over-week deltas, threshold breaches | Sheets API v4 |
Where raw signals turn into a decision. The piece most teams get wrong on the first attempt.
The orchestrator is the most important piece, and the one most teams get wrong on the first attempt. Its job is not to summarize. Its job is to correlate and judge.
A blocked Jira ticket is AMBER on its own. The same feature with a PR open five days and zero reviews is RED. The blocker is not the ticket status — it is a review bottleneck stalling an entire feature. The orchestrator catches this because it sees both signals at once.
The scoring pass runs in three stages:
Each collector returns signals in a standard schema, but severity thresholds differ per source. The orchestrator compresses every signal into a unified 0–100 severity scale where 0 is noise and 100 is drop-everything.
The orchestrator looks for signals from different systems referencing the same feature, team, or person. Correlated signals get a confidence boost — multiple systems agreeing on a problem is stronger evidence than any one source claiming it alone.
After normalization and correlation, each signal cluster receives a final confidence score. Clusters above 70 are RED, 40–70 are AMBER, below 40 are GREEN. Only RED and AMBER appear in the brief. The rest is GREEN summary.
| Signal Pattern | Source Count | Confidence Multiplier | Default Classification | Example |
|---|---|---|---|---|
| Single source, fresh data | 1 | 0.7× | AMBER at most | GitHub: 1 PR open 72h, no review |
| Single source, stale API response | 1 | 0.5× | GREEN (flag degraded) | Sheets API returning cached values |
| Two correlated sources | 2 | 1.0× | RED or AMBER per severity | Jira blocked + GitHub stale PR, same feature |
| Three or more correlated sources | 3+ | 1.3× | RED — escalate immediately | Jira P0 + CI failed on main + Sheets revenue dip |
| High false-positive history for signal type | any | −0.2× dampener | Downgrade one tier | PR age AMBER dampened after 6 logged FPs |
| Business metric delta + engineering signal | 2+ | 1.4× | RED — business impact confirmed | Latency spike (Jira INC) + revenue −8% (Sheets) |
Start too sensitive. Track false positives. Tune weekly. Without this loop, the radar becomes another notification nobody reads.
Here is the failure mode that kills most alert systems: thresholds get set on what feels reasonable, the system ships, nothing gets adjusted. Within two weeks the brief either cries wolf so often that leaders ignore it, or it misses a real incident because the thresholds were too relaxed.
The calibration loop is not optional. It is the feature that separates a useful radar from another notification channel.
Start intentionally too sensitive. Set every threshold at the aggressive end. PR open longer than 24 hours? AMBER. Two blocked tickets? RED. Business metric down 5% week-over-week? RED. The system should over-report in the first week. Tightening down is easy. Discovering missed signals after the fact is not.
Structured for fast scanning. RED first. Source attribution. Suggested move. GREEN summary at the bottom proves the radar ran.
Each RED and AMBER item carries the same four fields: source attribution, confidence percentage, temporal context, and a concrete suggested action. The leader reads the brief in 90 seconds and walks out knowing what needs attention and what the first move is.
The GREEN summary compresses to bullets on purpose. If everything is healthy, the leader does not need details — they need confirmation that the radar actually checked. A brief with no GREEN section is ambiguous: did the system find nothing, or did it fail silently?
Six failure modes that surface only after the system has been live for a week. Know them before you hit them.
The architecture looks clean in the diagram. Production finds the edges.
API timeouts are the most common. Jira Cloud and Google Sheets both have rate limits and occasional 5xx flakiness. A collector that hangs indefinitely blocks the orchestrator, which means the brief never lands. Fix: set a hard 15-second timeout per collector. If the timeout fires, return {"source": "jira", "source_status": "degraded", "signals": []} and continue. The orchestrator notes the degraded source in the GREEN section. Silent failure is worse than partial data.
Token budgets compound. Each collector call consumes tokens. At 2–4K input tokens per collector and five collectors running in parallel, you are spending 10–20K input tokens before the orchestrator even starts. The orchestrator prompt itself — loaded with five collector payloads — can run 8–15K tokens. Budget for 25–40K tokens per pipeline run. At Anthropic's current Sonnet pricing, that lands around $0.15–0.30 per run.[9] If you are running twice daily, that is roughly $9–18/month. Still orders of magnitude cheaper than the engineering-leader salary time it displaces, but worth tracking rather than discovering on your cloud bill.
Rate limits hit collectors at scale. The Jira Cloud REST API enforces per-user rate limits (currently ~100 requests/10 seconds for most Cloud plans). If your collector is running complex JQL across large projects, it can hit this. Fix: paginate with maxResults=50 and add 200ms back-off between paginated requests. GitHub's GraphQL API has a separate cost-points budget (5,000 points/hour), and a complex PR query can cost 5–10 points. Watch the X-RateLimit-Remaining response header and log it in the collector payload.
Orchestrator hallucination on low-quality collector payloads. If a collector returns inconsistent JSON (missing fields, wrong types), the orchestrator can misclassify or fabricate context. Add a schema validation step inside the orchestrator prompt: "If any collector payload is missing required fields, treat it as source_status: degraded and exclude it from correlation." This is a prompt-level guard, not a code-level one, because the orchestrator is an LLM.
Confidence score drift over time. False-positive dampening only works if you log every FP. Teams that skip the false-positives.jsonl file for two weeks watch their calibration go stale. The weekly review meeting needs to be on the calendar with an owner, not left to "someone will do it when it's annoying enough."
Delivering the brief to a public channel. The brief contains operational detail about your engineering teams — what is blocked, who is not reviewing, which metrics are down. Never deliver to a public Slack channel or a shared inbox. Use a private DM, a permission-locked Google Doc, or an encrypted email. Treat the brief as production output, not a newsletter.
Concrete rollout. Day-by-day. v1 before Friday.
The radar is not the right tool for every situation. Here is how to decide before you spend a week building it.
Team under 15 engineers — morning scan takes 10 minutes
Two or fewer tool surfaces to check
Signals are homogeneous (all in Jira, all in one Slack channel)
Team disciplines are strong — Jira hygiene is tight, PRs close in 24h
You want real-time monitoring, not a daily brief
Compliance requires a certified SIEM or observability platform
Managing 2+ squads across separate codebases and tool stacks
Spending 30+ minutes per morning on manual triage
Signals live in four or more systems that do not talk to each other
False positive rate on existing alerts is above 30%
Cross-system correlations are the signals that matter most
Existing dashboards require a human to assemble the picture
Without weekly threshold tuning, false positive rates climb above 25% within a month. Leaders stop reading the brief. You have built an expensive notification nobody checks.
Each collector queries its source fresh on every run. Shared caches create phantom signals — the Jira collector reports a blocked ticket that was resolved an hour ago because it read stale data from a shared store.
Start with three to five high-value signals per source. A radar that surfaces 40 items per morning is not a radar — it is a dashboard wearing a trench coat. Add signals later, once the system is stable.
Thresholds live in thresholds.json, not in the collector markdown files. That separation lets you tune sensitivity without editing agent prompts and risking accidental behavior changes.
The GREEN section proves the radar ran and checked everything. Without it, a brief with zero RED/AMBER items is ambiguous — silent failure looks identical to a clean morning.
Once the radar works for one director, the question is whether every engineering manager can run their own. Yes — with one architectural rule.
Collectors can be shared. A single Jira collector pulling all blocked tickets is more efficient than one per manager. The orchestrator must be personalized. Each leader cares about different teams, different projects, and runs different threshold tolerances.
The cleanest cut is a profiles/ directory where each leader has a config file specifying their teams, projects, and custom thresholds. The orchestrator loads the relevant profile and filters collector output accordingly.
That structure also opens an organizational view: if every leader's radar data is logged, a CPO or CTO can run a meta-analysis across all briefs. Which teams consistently show RED? Which systems generate the most false positives? Which cross-team dependencies surface as correlated signals?
One uncomfortable finding from teams that have rolled this out at scale: some managers actively resist the brief format. Not because the data is wrong. Because scanning dashboards manually was giving them a reason to open conversations with their teams. The morning Jira ritual was also an excuse to ping a teammate, notice something off in passing, stay close to the work. Automating the scan removes that ambient contact. Worth knowing before you push it to fifteen managers at once. The fix is structural, not motivational: replace the lost contact surface with a deliberate one — a 15-minute pulse with each direct, on a schedule, owned by the leader.
How much does this cost to run daily?
Each run invokes five parallel Claude Code subagent calls plus one orchestrator call. At typical prompt sizes (2–4K tokens input, 1–2K output per collector), that lands around $0.15–0.30 per run. Once daily costs $4.50–9.00 per month — orders of magnitude under the engineering-leader salary time it claws back.
What if one collector API is down?
Build timeout handling into each collector. If a source is unreachable after 15 seconds, the collector returns a payload with zero signals and a source_status: degraded flag. The orchestrator surfaces this in the brief so the leader knows the source was not checked. Silent failure is the worst outcome — design against it.
Can this run with something other than Claude Code?
The architecture is model-agnostic. The collector/orchestrator pattern works with any LLM that can make API calls and return structured JSON. Claude Code's subagent model makes parallelism particularly clean. You can implement the same shape with LangChain agents, CrewAI, or scripts hitting the Anthropic API directly. One practical note: Claude Code handles parallelism and subagent spawning natively, which saves roughly 50–100 lines of orchestration boilerplate compared to a hand-rolled LangChain version. For a production deployment running twice daily, that scaffolding cost is worth paying once. For a quick proof-of-concept, any approach works.
How do I handle sensitive data in the brief?
Collector subagents run inside your security boundary — they call APIs with your credentials and process data locally. The brief itself ships through an authenticated channel: private Slack DM, encrypted email, or a permission-locked Google Doc. Never deliver the brief to a public channel. Treat it as production output.
What is the right false positive rate to target?
10–15%. Below 10% the thresholds are too relaxed and real signals slip through. Above 20% the noise erodes trust and the brief stops getting read. Track FP rate weekly. Adjust thresholds to stay in band. The calibration loop is the radar.
Does this work with Slack instead of Google Chat?
Yes. Swap the Chat collector for a Slack collector that queries the Conversations API for unanswered threads in key channels. The signal schema is identical — what changes is the API endpoint and auth method. Slack's rate limit is 50+ requests/minute for most plans, which is well above what a single collector needs. Use conversations.history with a oldest timestamp set to 24 hours ago, filter by threads with reply_count: 0, and exclude bot messages with subtype != bot_message.
When should you NOT build this?
When your team is below 15 engineers, the signal-to-noise problem is usually not acute enough to justify the build time. The daily morning scan takes 10 minutes, not 45, and the false positive rate is manageable with a single Slack channel. Build the radar when you are managing multiple squads across different codebases, your tool count exceeds four, and you are spending more than 30 minutes daily on triage. Below that threshold, a shared Notion daily status doc updated by each team lead does the same job for less maintenance cost.
How do I handle GitHub Apps vs. personal access tokens?
Use a GitHub App, not a PAT, for production deployments. GitHub Apps have higher rate limits (15,000 requests/hour vs. 5,000 for PATs), can be scoped per-repository, and do not break when a team member leaves. Generate an installation access token at runtime using the App's private key — the token lasts one hour and can be refreshed without human intervention. Store the App private key in your secrets manager; never put it in the collector prompt file.
The signal radar is not a complicated system. Five focused collectors, one orchestrator, a delivery channel, and a calibration loop that tightens over time. The hard part is not building it — that week is behind you by Friday. The hard part is the discipline to log false positives and adjust thresholds for the first month, even when the brief is already useful enough that skipping the review feels acceptable.
Once calibrated, the radar restructures the morning. A 90-second brief lands before standup, names what needs intervention, shows the source attribution, and offers a first move.[4] Signals that crossed two systems overnight surface automatically. The ones that do not make the RED or AMBER threshold stay GREEN — proof the radar ran, not dead air.
Six dashboards were not observability. They were a coordination tax you paid every morning, in working memory, before the first 1:1 of the day. Pay it once, in code, and put the 45 minutes somewhere it compounds.
GMV is the scoreboard, not the game. Marketplace teams that wait for revenue to confirm a category is dying have already lost the merchants whose absence caused it. Four signals, one weekly brief, three to six weeks of warning before the line bends.
App Store reviews, NPS verbatims, Zendesk tickets, interview notes, community mentions — five inputs, five biases, five cadences. Treat them equal and the loudest channel wins. The fix is a normalization and weighting layer that produces one weekly brief.
Meeting transcripts produce decisions. The decisions vanish into a Notion graveyard within thirty days. A two-agent workflow extracts structured records and attaches review triggers that fire when conditions actually change — not on a calendar.