A reusable teardown rubric for reader-submitted AI apps, focused on visible failures, hidden risks, and what to fix first.
A reader-submitted teardown can go wrong in two ways. It can become a roast, which teaches nothing. Or it can become a generic audit, which sounds responsible while dodging the app's actual user promise.
The fix is to publish the rubric before the first app arrives. Builders should know what evidence to send, what will be tested, what will not be scored, and how findings will be ranked. That also protects the site from fake specificity. Until a real submission exists, the honest article is not 'we tore down your app.' It is 'here is the teardown contract.'
This article stays if framed as a resource, not as a fake case study. The MAST paper gives a useful failure taxonomy for multi-agent systems: specification and system design failures, inter-agent misalignment, and task verification or termination failures. LangChain's observability material points to traces, tool calls, state transitions, latency, cost, and retrieval steps. OWASP names LLM-specific risks. Together they form a fair review.
A teardown without runnable evidence becomes commentary.
The minimum packet is small: one sentence describing the core workflow, a test account with non-admin permissions, one successful run, one failed or uncertain run, known limitations, provider list, and consent to publish findings. If the app touches user data, the packet also needs sample data that can be safely tested. Do not ask reviewers to infer the product promise from a screenshot.
The reviewer should refuse submissions that require production customer data, private credentials, or unclear permission. A teardown is not worth becoming a security incident. The packet should include temporary credentials, fake data, and a path to revoke access after review.
The packet also prevents taste drift. If the stated workflow is 'generate a renewal-risk summary for an account manager,' the teardown tests that path. It does not score the logo, the market, or whether the founder picked the right idea. The review is about production evidence.
A strong packet also names what the builder already suspects. That does not bias the review. It makes it faster. If the builder says the import sometimes stalls, the teardown can test stalled imports directly. If the builder knows RLS was generated in a hurry, the review can inspect second-user access first. Hiding known concerns wastes the review on discovery instead of resolution.
Publication needs a separate standard. Findings should redact credentials, customer data, private URLs, and any detail that would help someone attack the submitted app. The public report should show enough evidence for readers to learn the production pattern without exposing the builder to unnecessary risk.
Workflow, credentials, success evidence, failure evidence, known limits, and provider list are enough to begin.
P0 blocks launch, P1 blocks confidence, P2 improves operations or polish after the dangerous issues close.
A submission should never send production customer data or long-lived credentials to a reviewer.
| Area | Evidence required | Fix priority |
|---|---|---|
| Core workflow | Browser run with primary action completed | P0 if blocked |
| Data boundary | Second-user access test and policy review | P0 if exposed |
| Agent behavior | Trace for tool calls, retrieval, and state transitions | P1 if invisible |
| Eval coverage | Golden cases for risky workflow | P1 if absent |
| Cost path | Per-workflow spend, retry count, and provider path | P2 unless runaway |
| Recovery | Failed external call preserves user work | P1 if data loss |
A fair teardown does not flatten every issue into a recommendation.
P0 findings are launch blockers: data exposure, broken primary workflow, irreversible side effects, client-side secrets, or unsafe actions the user did not authorize. These are not 'improvements.' They are reasons to keep the app away from real users.
P1 findings block confidence but may not block a private beta. Missing traces, no eval cases, vague recovery states, no rollback path, and unknown cost paths sit here. They make the app hard to operate. They also tend to become P0 under traffic because the team cannot see what failed.
P2 findings improve the product after the dangerous work is handled. Copy clarity, loading states, dashboard polish, and taxonomy cleanup matter, but they should not compete with exposed data. The teardown report should make that ordering obvious.
Reviewer comments on every visible issue with equal weight
Findings include taste and market judgments
No clear evidence packet before review
Builder receives a list but not a launch decision
Findings are ranked by user harm and operating risk
Review excludes taste, market size, and idea judgment
Submission requires runnable workflow and safe credentials
Report separates blockers, confidence gaps, and follow-up fixes
One sentence describing the user workflow to test.
Temporary non-admin credentials with fake or consented data.
Known limitations the builder already sees.
A screen recording or trace for one successful run.
A screen recording, trace, or description for one failed or uncertain run.
List of model, database, auth, payment, and hosting providers involved.
Permission to publish findings and redact sensitive details.
A revocation plan for review credentials after the teardown.
Confirm the reviewer can run the workflow with fake or consented data and short-lived credentials.
Visible behavior sets the review frame. The code and traces then explain what happened.
Separate P0 blockers, P1 confidence gaps, and P2 improvements so the builder knows what to fix first.
Keep the concept by making the promise honest.
The previous skeleton risked implying that a real submitted app had already been reviewed. That would weaken trust. The production-ready version should be explicit: this is the rubric for future reader submissions, and it is published now so the review standard is visible.
That framing also creates a better product surface. Readers can submit apps against a known packet. Future teardown articles can reuse the same scoring model. The site gains a repeatable format without inventing experience it does not yet have.
The strongest version of this article is part editorial policy and part intake form. It tells builders what a serious submission looks like, tells readers why findings are ranked the way they are, and gives future teardown posts a standard they can be judged against. That is more valuable than a premature case study.
It also protects the publication voice. AI Native Builders should be blunt, but it should not be careless. Publishing the rubric before the submissions says the site is willing to inspect real apps under a clear contract, not ambush builders for entertainment.
A good teardown should end with retest evidence, not just criticism. For each P0 or P1 finding, the report should say what would prove closure: a passing browser workflow, a second-user policy test, a trace showing the corrected tool path, or an eval case that now passes.
That closure rule keeps the review useful after publication and retesting.
This article should stay, but its title and body must keep the reader-submitted promise as a process, not as an event that has already happened.
Can I submit an app that uses real customer data?
No. Use fake data, anonymized data, or a dedicated review workspace. A teardown should not require production customer records or long-lived credentials.
Will the teardown review product-market fit?
No. The review is limited to production readiness: workflow completion, data boundaries, agent behavior, evals, cost, and recovery.
What makes a finding P0?
P0 means the app should not launch to real users until fixed: data exposure, broken core workflow, unsafe side effects, client-side secrets, or irreversible failure paths.