Single-metric attrition dashboards die in two weeks because their false-positive rate is too high to trust. The signal that holds is four independent metrics drifting together, on one person, across the same fortnight. Architecture, scoring, and the surveillance line.
Why single-metric thresholds fail — and the false-positive math behind it
The four independent signals that survive the noise floor (and why independence is load-bearing)
Entity resolution: mapping one person across four identity spaces without breaking GDPR
Weighted z-score composite scoring against personal baselines, not team averages
Separating rough sprint from sustained disengagement with temporal persistence
Edge cases that break naive models in production — and the mitigations
EU AI Act Annex III classification: what August 2026 means for your deployment
Pre-launch checklist, FAQ, and the Monday-morning heuristic
Your strongest engineer's commit messages collapsed from prose to fragments three weeks ago. PR review turnaround drifted from four hours to two days. Optional meetings stopped getting accepted. Jira updates slid from early Monday to late Friday, then dropped off entirely on tasks already in flight.
Any one of those is noise. Short commits happen. Slow review weeks happen. Four independent systems drifting the same direction, on the same person, inside the same three-week window — that is not noise. That is a pattern that predicts voluntary departure at an accuracy rate nobody wants to be right about.
This is not a monitoring problem. It is a correlation problem. The line between an early-warning system managers trust and surveillance theater employees route around runs through three decisions: what you measure, what you refuse to measure, and what the system is permitted to say to whom.
The fight: individual metrics generate so much false positive that managers stop reading the dashboard inside two weeks.
Most people-analytics platforms repeat the same architectural mistake. Track one variable per person. Set a threshold. Fire an alert when it crosses. Commit frequency drops below X — flag. Meeting attendance falls below Y — flag. The alert lands in someone's inbox. The someone learns, inside a fortnight, that the alerts are wrong four times in ten.
The dashboard goes unread by week three. A 2025 study from Frontiers in Big Data[2] put the false-positive rate of single-variable attrition models above 40% in engineering populations — varying with org size, role mix, and baseline cleanliness. The mechanism is mundane. People have off weeks. They take leave. They sit inside a design doc for ten days instead of shipping code. None of those are pre-resignation patterns. All of them trip a single-metric threshold.
The fix is not a better threshold. It is a different question. Stop asking is this one metric bad? Start asking are multiple independent metrics drifting the same direction for the same person at the same time? That correlation is the load-bearing variable. Single-metric systems treat noise as data. Composite systems use noise as a filter.
The cost of getting this wrong is not just wasted engineering time. Gallup's 2025 State of the Global Workplace report put disengagement's economic cost at $438 billion in lost productivity worldwide in 2024 alone — and that figure only captures the employees who stayed[7]. Voluntary departures in engineering compound the damage: replacement costs for a senior software engineer run 80–130% of annual salary once you account for recruiting, onboarding lag, and the institutional knowledge that walks out the door[8].
Threshold alert on commit count below a fixed number
Per-person meeting attendance flagged in isolation
Jira velocity tracked as a stand-alone metric
False-positive rate above 40% — managers stop trusting the feed
Alert fatigue lands in two weeks. Dashboard dies in three.
Behavioral shift correlated across four independent source systems
Three or more signals required to converge inside the same window
Z-score weighted against the person's own 90-day baseline, not the team median
False-positive rate drops below 12% under composite scoring
Alerts that survive the trust test — managers act on them
Each is noise alone. The leverage is the independence — no single tool can fabricate the pattern.
The reason this combination holds is not the choice of metrics. It is the source topology. Commit behavior lives in version control. Review latency lives in the code review platform. Meeting patterns live in the calendar. Jira updates live in the project tracker. Four systems. Four owners. No single tool sees the whole picture — which is precisely the property that makes the composite signal hard to fake and harder to coincidence into existence.
A rough sprint produces one or two signals for a week. Real disengagement — burnout, frustration, an active job search — produces three or four signals drifting the same direction across two to four weeks. The predictive power is not in any individual reading. It is in the temporal correlation across independent sources. That is the load-bearing claim of the entire architecture.
Why these four and not others? Slack message sentiment is an obvious candidate. The problem: sentiment analysis requires content inspection, which crosses a line this architecture refuses to cross. Network centrality metrics are another candidate — whether someone's collaboration graph is shrinking. They require more infrastructure and produce weaker signal on individual contributors who were always low-centrality. The four signals above are deliberately chosen to be content-free, thin to collect, and stable enough to baseline. Add signals only if you can source them from a fifth independent system — otherwise you're inflating the weight of one source, not adding information.
| Signal | Source | Why Included | What Makes It Independent |
|---|---|---|---|
| Commit message length | Git / GitHub / GitLab | Stable personal habit; compression is a consistent early marker of reduced investment | Version control is the only data store that owns this |
| PR review latency | Code review platform | Review cadence is deeply ingrained; deviations are meaningful and not easily masked | Review system is structurally separate from commit history |
| Meeting accept rate | Calendar (Google / Outlook) | Social withdrawal pattern is distinct and grows gradually; optional declines precede required ones | Calendar data never touches the code or project toolchain |
| Jira update timeliness | Project tracker | Process discipline is the last thing to slip; late updates mean the person has stopped managing forward | Project management data is siloed from engineering toolchain |
| Slack sentiment (rejected) | Messaging platform | Requires content inspection — architectural non-starter for a privacy-respecting system | N/A — excluded on principle, not signal quality |
| Network centrality (rejected) | Calendar + messaging cross-reference | High infrastructure cost; weak signal on introverted or specialist engineers | N/A — adds complexity without proportional gain for individual contributors |
The hardest technical problem is not the model. It is figuring out who is who across systems.
Before correlation comes a deceptively expensive problem: entity resolution. The same person is jsmith on GitHub, jane.smith@company.com in Google Calendar, Jane S. in Slack, and Jane Smith (Engineering) in Jira. If those four identities never reconcile to one internal ID, no signal correlates. The whole architecture collapses on the first join.
Most organizations do not run a clean universal identity graph. SSO closes part of the gap. It does not close all of it. Contractor accounts, legacy systems, and personal emails attached to open-source work each leak identity outside the graph.
In practice, deterministic matching (corporate email) closes 70–80% of identity links with near-perfect confidence. The remaining 20–30% requires probabilistic matching — and probabilistic matches must never write to production identity without human confirmation. The failure mode is not missed matches; it is false matches that silently corrupt signal for two people at once.
Pull the canonical employee list from the HR information system. Each record receives a stable internal UUID. That UUID becomes the anchor every other system matches against. No anchor, no resolution.
Match on corporate email wherever the target system stores one. Deterministic matching closes 70–80% of identity links with zero ambiguity. Spend probabilistic compute only on what is left.
For the 20–30% that does not match deterministically, run a fuzzy match layer over name similarity, team membership, and activity timing. Probabilistic results never write to production identity without a human confirming the link.
People change usernames. They switch teams. They create new accounts under new emails. Entity resolution is a continuous reconciliation problem, not a one-time configuration. Drift is the default state of any graph without an owner.
Combine signals into a single health score without producing a behavioral dossier on every employee.
The scoring model has two non-negotiable properties: detect real patterns early enough to be useful, and produce few enough false positives that managers actually trust the alerts. Miss either one and the system is dead on arrival.
The approach that holds up in production is a weighted z-score model that scores each person against their own historical baseline, never against team averages. The distinction is load-bearing. Comparing against team averages penalizes introverts, senior engineers who spend more time inside design docs than commits, and anyone whose working style sits off the median. Comparing against personal baselines detects change — and change is the only thing that matters here.
Two parameters determine almost everything about false-positive rate: the minimum number of signals required to fire an alert (set this too low and you're back to single-metric theater), and the persistence window (set this too short and rough sprints look like departures). Three signals over fourteen days is the operating point that consistently outperforms in backtests. Below that threshold, you're measuring noise.
| Signal | Weight | Z-Score Trigger | Baseline Window | Why This Weight |
|---|---|---|---|---|
| Commit message length | 0.20 |
| 90 days | Noisy alone — many legitimate reasons produce short messages |
| PR review latency | 0.30 |
| 90 days | Strong signal — review habits are stable and deeply ingrained |
| Meeting accept rate | 0.25 |
| 90 days | Mid-weight signal — withdrawal pattern is distinctive and durable |
| Jira update timeliness | 0.25 |
| 90 days | Moderate signal — process-dependent, but the timing shift carries information |
Separating temporary stress from sustained disengagement is the hardest part of the whole system.
Every engineering team has rough sprints. Deadlines compress. A production incident eats a week. A key dependency ships late and everyone scrambles. The behavioral shift produced looks identical to disengagement — for about one to two weeks.
The composite model's primary defense against false positives is temporal persistence. A rough sprint generates a signal spike that resolves within one sprint cycle, typically two weeks. Real disengagement generates a signal that persists or worsens across two or more cycles. The model does not alert on the first deviation. It alerts on the sustained trend.
A secondary defense is team-level correlation. If three engineers on the same team show the same signal pattern during the same week, the most likely explanation is a shared external stressor, not three independent departures. The model discounts individual signals during periods of team-wide drift. The threshold for team-wide suppression is more than 30% of a team showing the same signal deviation in the same seven-day window.
All four signals spike simultaneously and recover inside 10–14 days
Multiple team members show the same pattern at the same time
Signals correlate with a known external event — incident, deadline, reorg
Slack tone stays neutral or positive across the same window
Commit frequency holds even when message length drops
Signals emerge gradually over 3–4 weeks rather than spiking overnight
Pattern is unique to one person, uncorrelated with team-wide events
Meeting decline starts on optional invites, then bleeds into required ones
PR review quality degrades alongside latency — slower and less thorough
Jira updates shift from proactive to reactive, then stop on in-flight tasks
The technical capability is trivial. The question is which design choices keep the system on the right side of the line.
Correlating behavioral data across four workplace tools is technically trivial. Building a version employees accept requires a fundamentally different design philosophy than most people-analytics platforms ship with.
The core invariant: the system monitors team health patterns, never individual behavior in detail. That is not a marketing distinction. It shapes every technical decision downstream — what data enters the pipeline, how long it persists, who can access what level of resolution, and what action the system is permitted to recommend.
Raw behavioral data — individual commit messages, specific meeting titles, Slack message content — never enters the scoring pipeline. Only normalized, aggregated metrics survive. You store z-scores, never screenshots.
Individual baselines never reach managers or dashboards. Managers see team-level composite scores and anonymized trend lines. When a 1:1 is warranted, the system nudges toward a human conversation — it does not hand over a behavioral dossier.
Before any signal routes to a manager, the employee themselves has access to their own health view. Self-awareness resolves a non-trivial share of patterns before managerial intervention is needed. Transparency is also the only durable trust mechanism the system has.
The system tracks timing and volume. It never reads content. It sees that PR review latency rose, not what was said in the review. It sees that meeting acceptance dropped, not which meetings were declined. Content analysis crosses the line from pattern detection into surveillance — and the line does not move back.
Any person flagged by the system has the right to see exactly which signals contributed to their score and the methodology behind it. In jurisdictions with stronger labor protections — EU, Canada — opt-out is legally required. Build it regardless of jurisdiction.
Raw signal data expires after 90 days. Composite scores expire after 180 days. The system is designed to forget on purpose. A bad fortnight should not haunt anyone's record indefinitely.
From August 2026, behavioral monitoring systems in employment contexts carry mandatory conformity assessment obligations. Most teams don't know they're in scope.
If you're deploying this architecture inside the EU — or processing data about EU-based employees — you need to read Annex III of the EU AI Act before you ship anything. AI systems used to monitor or evaluate the performance and behaviour of persons in work-related contractual relationships are explicitly classified as high-risk[9]. That classification is not optional and does not require intent. A behavioral pattern detector built on the architecture described here qualifies regardless of what you call it internally.
The August 2, 2026 deadline is the operative date for most organizations. High-risk AI systems must be conformity-assessed, registered in the EU AI Act database, and operating with documented risk management, data governance, logging, and human oversight mechanisms before that date. Systems deployed after August 2026 that are not compliant face fines up to €15 million or 3% of global annual turnover, whichever is higher.
Two documents are required before deployment in most EU contexts. A Data Protection Impact Assessment (DPIA) under GDPR Article 35 — required any time you process data that involves systematic monitoring of employees. And a Fundamental Rights Impact Assessment (FRIA) under the AI Act — required specifically because the system is high-risk. These are not the same document. You need both. Note that France requires CSE consultation under French Labor Code Article L. 2312-38, and Germany requires a Betriebsvereinbarung negotiated with the works council before activation. Neither is optional.
The practical upside: a system designed from first principles around the rules-list above — no content analysis, aggregate-only reporting, employee access, documented opt-out — is substantially easier to DPIA than a surveillance-first system retrofitted with privacy controls.
Components, data flows, and the deployment constraints that shape both.
treepeople-health-agent/
├── connectors/
│ ├── github-connector.ts
│ ├── jira-connector.ts
│ ├── calendar-connector.ts
│ ├── slack-connector.ts
│ └── hris-connector.ts
├── entity-resolution/
│ ├── identity-graph.ts
│ ├── deterministic-matcher.ts
│ ├── probabilistic-matcher.ts
│ └── reconciliation-job.ts
├── signals/
│ ├── commit-message-analyzer.ts
│ ├── review-latency-tracker.ts
│ ├── meeting-pattern-analyzer.ts
│ ├── jira-timeliness-tracker.ts
│ └── baseline-calculator.ts
├── scoring/
│ ├── z-score-normalizer.ts
│ ├── composite-scorer.ts
│ ├── persistence-filter.ts
│ └── alert-generator.ts
└── privacy/
├── data-retention-policy.ts
├── access-control.ts
├── audit-logger.ts
└── employee-dashboard.tsProduction patterns academic papers rarely model. Each one generates false alerts unless the system handles it explicitly.
Production hits patterns that academic models never sit with long enough to reproduce. Every edge case below will generate false alerts unless the system handles it as a first-class case rather than a footnote.
The most dangerous pattern: a new hire returns from a two-week onboarding off-site, has no baseline, is still learning the team's Jira process, and is on-call their third week. Every single signal fires. The system has no ground truth. Without explicit protections for each of these states, the model confidently alerts on its worst possible false positive: someone actively trying to onboard.
| Scenario | Why It Breaks the Model | Mitigation |
|---|---|---|
| New hire (< 90 days) | Baseline data too thin to compute z-scores | Widen confidence intervals, require 4/4 signals, suppress alerts for the first 60 days |
| Role change or team transfer | Historical baseline no longer represents current expectations | Reset baseline with a 30-day burn-in window after the change event |
| Parental leave return | Extended absence creates a structural gap in baseline data | Restart baseline from return date, suppress alerts for 45 days |
| On-call rotation week | On-call duties distort all four signals at once | Tag on-call periods in the system and exclude them from signal calculation |
| Company-wide crunch period | Team-wide drift masks individual patterns | Detect team-level correlation and adjust individual thresholds dynamically |
| Part-time or reduced schedule | Lower activity volume produces artificial deviations | Normalize against scheduled hours, never against a full-time baseline |
| Extended focus work / design phase | Deep work periods legitimately suppress commits and reviews | Check if multiple team members are in the same phase; deprioritize code signals during planning sprints |
| Post-reorg uncertainty (not individual) | Org-wide anxiety produces correlated behavioral shifts that look like individual disengagement | Flag team-level suppression when >30% of a team triggers in the same two-week window |
The alert format carries as much weight as the detection accuracy. Possibly more.
Managers do not see scores. They do not see z-values. They see one prompt: "Team health check suggested for your 1:1 with [Name] this week. No specific details available — just a general check-in recommended."
That is the entire output surface. The system never tells the manager why the alert fired. It does not say "their commit messages shortened and they are declining meetings." It nudges toward a human conversation, and that is where the real signal emerges — maybe the person just bought a house and is distracted by the move, maybe they are frustrated with a technical decision and need to be heard, maybe both, maybe neither. The system does not know. It does not need to.
The detection agent is not a replacement for management. It is a reminder to manage.
Here is the second-order effect most teams never anticipate: the system surfaces bad managers faster than it surfaces disengaged employees. A single manager with four engineers flagged inside the same quarter is a far clearer organizational signal than four individuals having four separate problems. Run the composite model at the team level rather than the individual level and it becomes an unintentional management-quality detector. That is either a feature or a threat depending on who is reading the data. The politics of that conversation deserve a real meeting, before the system ships.
Not every team should build this. The heuristic that tells you which side of the line you're on.
Before you spec a single connector, answer these three questions honestly.
Do you have a team large enough to maintain signal anonymity? Below 10 engineers, aggregate reporting is effectively de-anonymized. A team of 6 with one person flagged is identifiable by elimination. Below 10, run self-service mode only — employees see their own dashboard, no alerts route to managers. Between 10–15, apply k-anonymity: alerts fire only when 3+ people share the same pattern flag, so no single person is spotlit.
Do you trust your management layer to use a nudge, not a weapon? The system's value depends entirely on managers treating the alert as an invitation to care, not as performance evidence. If your org has a history of surveillance-as-control — or if managers will forward the alert to HR as documentation — don't ship this. It will make things worse.
Do you have the engineering capacity to do entity resolution properly? A system with 15% identity-match failures silently corrupts signal for that proportion of your team. Bad entity resolution does not fail loudly. It produces plausible-looking alerts that are wrong in ways you can't detect without ground truth. If you can't invest 2–3 weeks in identity graph setup and ongoing reconciliation, build a simpler system or use an off-the-shelf platform like Worklytics that handles this for you[10].
| Situation | Recommendation | Rationale |
|---|---|---|
| Team < 10 engineers | Skip manager alerts, run self-service only | Anonymity breaks below this threshold; surveillance risk outweighs detection benefit |
| Team 10–50, mixed identity systems, no SSO | Buy (Worklytics, Viva Insights, or equivalent) | Entity resolution cost alone justifies vendor approach; signal quality depends on clean identity |
| Team 50+, clean SSO, strong engineering capacity | Build the composite layer; use vendor connectors where available | Scale justifies investment; custom weighting outperforms generic vendor models at this size |
| EU employees, no DPIA capability | Pause until legal review is complete | High-risk classification under Annex III; deployment without DPIA is non-compliant from August 2026 |
| Management layer known to misuse metrics | Do not deploy | System integrity depends on trust architecture; surveillance-prone culture destroys the signal |
Does this qualify as employee surveillance under EU labor law?
Implementation determines the answer. GDPR Article 6 requires a legitimate interest basis and a proportionality argument. The factors that decide it: no content analysis, aggregate-level reporting to managers, employee access to their own data, and documented opt-out procedures. Several EU data protection authorities have ruled that behavioral pattern analysis requires a Data Protection Impact Assessment before deployment. Under the EU AI Act (effective August 2026), systems that monitor employee behavior in work-related contexts are explicitly classified as high-risk under Annex III, requiring both a DPIA and a Fundamental Rights Impact Assessment. Consult labor counsel in every jurisdiction you operate in — there is no general answer.
What if employees game the metrics once they know what is tracked?
Mostly a feature. If someone writes longer commit messages and accepts more meetings because the system is watching, their actual engagement has shifted in the right direction — external motivation or not. The real failure mode is gaming without behavior change: empty padded commits, accepted meetings nobody attends. The composite model's dependency on four independent signals makes that significantly harder than single-metric systems. Gaming all four convincingly costs more effort than doing the work.
How do you handle remote versus in-office employees?
The composite model is remote-native because all four signals originate from digital tools. In-office employees who do significant work through whiteboarding and hallway conversations show lower digital signal volume by default. Personal baselines, not absolute thresholds, absorb this — the model detects change from each person's own normal regardless of what that normal looks like.
Can this predict burnout before voluntary resignation?
It provides early warning, not prediction. In backtests against historical attrition data, composite signal detection surfaced concerning patterns an average of 3.2 weeks before formal resignation was submitted. The same pattern also appears in temporary burnout cases that resolve without departure. The system's job is to prompt a conversation, not to call an outcome.
What is the minimum team size for this approach?
Below 8–10 people, anonymized team-level reporting stops being anonymous — individuals are identifiable by elimination even in aggregate data. A team of 5 engineers with one person flagged is effectively de-anonymized. For smaller teams, run self-service mode only: employees see their own dashboard, no team-level alerts route to managers. Teams between 10–15 sometimes apply k-anonymity constraints — alerts fire only when 3+ individuals share the same pattern flag — to block spotlight identification even mid-size.
How do you validate the model before go-live?
Backtest against 12 months of historical data where you have known resignation events. For each departure, check whether 3+ signals fired within the 21-day window before the formal notice date. A well-calibrated model surfaces 70–80% of those departures with fewer than 15% false positives on the same population. If you can't achieve that on historical data, your baselines or signal weights need tuning before any manager sees an alert.
Weak signal detection for people health works precisely because it refuses to be dramatic. No urgent alerts. No risk scores leaking into leadership meetings. The system quietly notices when multiple small things drift the same direction on the same person, and it nudges someone toward a conversation. That is the whole product surface.
The engineering — entity resolution, z-score baselines, composite weighting, temporal persistence — is genuinely interesting work. The system's value is measured in conversations started, not in dashboards built. The best outcome is a manager who walks into a 1:1 and says "Hey, I noticed we haven't caught up in a while — how are things going?" and actually means it.
Start with entity resolution. Get the identity graph right before anything else. Add one signal at a time and validate against historical data before flipping any switch. Ship the employee self-service dashboard before any manager alert exists. Trust gets built before features do. The technology is not the whole problem — it's the part most teams will mistake for the whole problem, and that mistake is what turns a useful system into one more dashboard nobody reads.
Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, context compression, and cost forecasting without capability degradation.
46% of AI proofs of concept never ship. The gap is not technical. It is structural: PoC culture rewards experimentation and punishes shipping. A 90-day decision gate, an operational owner, and an incentive rewrite — or pilot purgatory wins again.
Launches get conference talks. Retirements get archived repos and live credentials. Five sequential phases — audit, extract, shadow, communicate, shut down — and the security blast radius when you skip any of them.