A production playbook for detecting AI agent failures that look successful in the UI but fail the user's actual task.