A security playbook for reviewing AI-generated code before it turns into production exposure.
The uncomfortable part of AI-generated code is not that it sometimes fails. Human code fails too. The uncomfortable part is that generated code can look calm while skipping the boring security checks a senior engineer would expect.
Veracode's 2025 GenAI code security research reported that 45 percent of AI-generated code samples introduced OWASP Top 10 vulnerabilities. The same source says the study covered more than 100 large language models across 80 coding tasks. That number should not be used as panic bait. It should change the release process. Any team accepting AI-written code needs a security gate designed for confident omissions.
This article stays because it anchors the security pillar in a concrete risk. The site should not say 'use AI carefully' and stop there. It should tell builders how to review the code that arrives faster than the old review model can absorb.
The generated code usually fails where the prompt did not name the real boundary.
A model can produce a login form, API handler, database call, and admin page without knowing the organization's permission model. It can infer the shape of code. It cannot infer the rule that contractors may read only assigned projects, that support agents may impersonate users only through an audited path, or that a draft invoice should be invisible until approved. When the prompt omits the boundary, the generated code often picks the easiest path.
That is why the review should start with data flow, not style. Trace user input to server code, database writes, model prompts, rendered output, logs, and third-party calls. Mark every place untrusted input crosses into privileged code. Then inspect whether the generated code validates, authorizes, sanitizes, limits, and logs at that boundary.
OWASP's LLM application risks make this broader than ordinary web security. Prompt injection, sensitive information disclosure, insecure output handling, excessive agency, and unbounded consumption all appear when model output is treated as if it came from trusted application code. AI-written code can create the classic bug and the LLM-specific bug in the same path.
Veracode reported that 45 percent of AI-generated code samples failed security tests in its 2025 GenAI code research.
Veracode described the study as covering more than 100 large language models across security-sensitive coding tasks.
Secrets, auth, data policies, output handling, and side-effect controls should be checked before merge.
| Review area | Question to answer | Release blocker |
|---|---|---|
| Secrets | Can any secret, service role, or private token reach browser code or logs? | Any secret crosses a client or log boundary |
| Authorization | Does the server enforce the same ownership rule the UI implies? | UI-only access control |
| Data policy | Do database policies match the product's tenant and role model? | Exposed tables without policy coverage |
| Output handling | Is model output validated before rendering, storage, or action? | Raw output drives HTML, SQL, shell, or privileged actions |
| Agency | Can generated code call tools or mutate state beyond the user's intent? | Broad tool permissions without server checks |
Security instructions help generation, but they do not replace verification.
Security prompting is still useful. Ask for parameterized queries, least-privilege credentials, output validation, and tests. Ask the model to name assumptions. Ask it to produce a threat model for the changed path. Those prompts may improve the draft. They do not prove the draft is safe.
The release gate should assume the generated code is incomplete until evidence says otherwise. A scanner can catch known vulnerability classes. Unit and integration tests can catch authorization and validation behavior. Browser checks can prove the UI cannot complete an unsafe path accidentally. Human review can identify product rules the model could not infer. Each layer catches a different kind of omission.
The most important tests are often small. A second-user test that fails to read another tenant's row is more valuable than a broad claim that 'auth was reviewed.' A test that sends model output containing markup, SQL-looking text, or tool arguments through the renderer is more useful than a note saying 'sanitize output.' Security confidence needs executable evidence.
Generated code compiles and the screen works
Prompt asked the model to follow secure practices
Reviewer scans for obvious mistakes in the diff
Security is postponed until the feature matters
Changed trust boundaries are traced before merge
Auth, policy, output, and secret checks are executable
Scanner output is mapped to the affected workflow
High-blast-radius paths block launch until fixed
Trace all user input from browser to server, model, database, logs, and third-party calls.
Search the client bundle for secret, service-role, and private provider credentials.
Test server-side authorization with a non-owner user.
Verify database policies for read, write, update, delete, and storage access.
Validate model output before rendering, persistence, tool execution, or external calls.
Run SAST, dependency, and secret scans on generated code.
Add regression tests for every security finding fixed in the review.
Block merge for data exposure, broad tool permissions, and non-idempotent unsafe actions.
Name whether the generated code touches auth, data, money, file upload, model output, external calls, or privileged tools.
Before hardening, reproduce the risky behavior with a second user, malformed output, missing permission, or unsafe tool argument.
A fixed vulnerability needs a passing test, scanner result, or review note tied to the exact path that failed.
The pivot needs one piece that makes AI-code security impossible to wave away.
The 45 percent number is useful because it stops the conversation from drifting into taste. Whether a team likes AI coding tools is irrelevant. The question is whether its release process can absorb a higher volume of code that may omit security context.
The article should avoid pretending all AI-written code is unsafe or that human-written code is safer by default. The defensible position is narrower: generated code needs a review process calibrated to its failure pattern. That means boundary tracing, tests, scanner output, and explicit ownership.
The review process should also capture why the model made the mistake when that is visible. If the generated route trusted a client-side workspace ID, the missing context may be the tenant model. If it exposed a secret, the missing context may be deployment boundaries. If it rendered raw model output, the missing context may be the downstream sink. Those notes make future prompts and review checklists better without pretending prompts alone are the fix.
This gives the article a useful edge: it is not another warning that AI code can be risky. It is a workflow for converting AI-code risk into release evidence.
This piece is viable for the corpus because it gives builders a security gate they can run this week. It also links naturally to the Lovable teardown, prompt-change article, and production checklist.
Does AI-generated code fail security more than human code?
The safer claim is that AI-generated code can omit security context while appearing complete. Compare it through your own review and test gates rather than assuming either source is safe.
What should block an AI-generated code PR?
Client-side secrets, missing server authorization, exposed database policies, unsafe output handling, broad tool permissions, and untested non-idempotent actions should block merge.
Can better prompting fix the 45 percent problem?
Better prompting can improve drafts, but it is not verification. Keep prompts, tests, scanners, and human review as separate layers.