Most first AI picks fail because the workflow was wrong, not the model. Score risk, value, and signal quality as separate axes. Treat your first three pilots as three different questions about the organization. Pick boring. Pick measurable. Pick diverse.
Roughly 88% of experiments do not produce a clean primary-metric win. The bottleneck is interpreting the ones already concluded — not running more. An agent that pulls results, retrieves related history, cross-references releases, and proposes the next three tests closes the gap.