We Built 35 AI-Generated Apps and Put Every One Through a Security Scanner

By Tom Raef, Founder – We Watch Your Website

There’s a moment in every security researcher’s career when something stops being theoretical and becomes undeniably real. For me, that moment came about halfway through this project, when I realized that every single platform we tested, without exception, was telling its users their app was ready to deploy. Our scanner told a different story.

This is that story.

The Setup

We Watch Your Website has spent 18 years protecting WordPress sites. We know what malicious code looks like, how attackers get in, and what developers miss when they’re moving fast. When AI-generated application platforms started gaining real traction, we got curious: what does our scanner see when it looks at code a non-developer built in an afternoon?

To find out, we built a research series around a simple premise. Take 5 representative web applications, the kind a small business or startup might actually need, and build each one on 7 different AI platforms using identical prompts. Same apps, same requirements, same inputs. Then run every one through our automated security pipeline and see what comes out.

The five applications:

A SaaS dashboard with authentication and an admin panel
An API key manager and secret vault
A file upload and sharing service
An invoice generator with Stripe payment integration
A CRM with role-based access and data export

Thirty-five applications total. Thousands of lines of generated code. All of it analyzed by the same pipeline, under the same conditions, with the same methodology.

The Pipeline

Before getting into findings, it’s worth explaining how we analyze code, because the methodology matters as much as the results.

Secrets Scan: Committed keys, tokens, and hardcoded credentials in source and git history.
AST Scan: Static pattern matching across the source tree for know vulnerability patterns.
Dependency Audit: CVE matching against package.json and lockfiles. Abandoned package detection.
LLM Deep Review: Full multi-file context analysis tracking tainted input across file boundaries to dangerous sinks. Not pattern matching. Actual code comprehension.
AST + Source Map Analysis: Bundle analysis and source map recovery for deployed applications.
(Part of #5)
Adversarial Verification: Every finding challenged by a second LLM pass before confirmation. 82% noise reduction. 307 of 375 findings dismissed as false positives.
Attach Chain Construction: Verified findings assembled into multi-step exploitation scenarios showing how an attacker moves from zero access to full compromise.
Live Penetration Testing: Headless browser and HTTP probes against the deployed application. Unauthenticated route access, JWT tampering, localStorage injection. Screenshots as evidence.
Automated Fix Generation: For every confirmed live vulnerability, the pipeline analyzes the actual source and generates a specific code patch, a unified diff against the real files that to change.

Our pipeline runs 10 stages against every application. Early stages scan for committed secrets, run static analysis rules, check dependencies against vulnerability databases, and audit packages for abandonment. Stage 4 uses a large language model to perform deep code review with full multi-file context, tracking tainted user input across file boundaries to identify where it reaches dangerous sinks, something single-file scanners miss entirely. This isn’t pattern matching. It’s actually understanding what the code is trying to do and where it fails.

Then comes Stage 7.

Stage 7 is what separates our tool from every static scanner on the market. After the first six stages generate findings, Stage 7 subjects each one to adversarial verification. A second LLM pass asks a pointed question: is this finding real and exploitable, or is it noise? Every finding gets challenged before it’s confirmed. The result is a dramatic reduction in false positives, across this research, Stage 7 dismissed 307 of 375 total findings as false positives. That’s an 82% noise reduction rate.

Without Stage 7, we’d have published 375 scary-sounding findings and called it a day. With it, we published 68 confirmed, verified, exploitable vulnerabilities. The difference matters enormously. A tool that cries wolf trains developers to ignore alerts. A tool that only speaks when it has something real to say gets listened to.

Stage 8 takes the verified findings and constructs attack chains — multi-step exploitation scenarios showing how individual findings combine into real-world attacks. Where a single finding might look manageable in isolation, an attack chain shows a CTO exactly how an attacker moves from zero access to full compromise in a sequence of concrete steps.

Stage 9 goes further still. After static analysis identifies and verifies a finding, Stage 9 deploys a headless browser and targeted HTTP probes against the live application. It navigates to protected routes without authenticating. It sends JWTs with the algorithm set to none. It injects fabricated tokens into localStorage and reloads the page. If it can confirm the vulnerability is exploitable on the deployed app right now, not in theory, but in practice, it captures the response, takes a screenshot, and records the exact technique that worked.

The reason this matters: static analysis tells you what the code says. Live testing tells you what the application does. Those are not always the same thing. A finding that looks dangerous in source code might be protected by infrastructure that the source scan can’t see. A finding that seems minor might be catastrophically exploitable because of a deployment decision that was made after the code was written. Stage 9 resolves the ambiguity.

Stage 10 closes the loop. For every vulnerability confirmed live by Stage 9, Stage 10 loads the original source code and uses an LLM to generate a specific, actionable fix, a unified diff against the actual files that need to change, not generic advice about security best practices. It identifies the root cause, generates the patch, and explains how to verify the fix is working. The output is something a developer can open, review, and apply.

Find it. Verify it. Prove it’s exploitable. Fix it. That’s the full pipeline.

What We Found

The number that matters most

Thirty-five applications analyzed. One clean pass.

Every other application had at least one confirmed security finding. Six were serious enough that our pipeline would have blocked deployment entirely. The platforms building those applications all reported the apps were ready to go live.

That gap, between what the platform says and what independent analysis finds, is the core finding of this research.

The finding that appeared everywhere

If there is one vulnerability that defines AI-generated web applications in 2026, it is client-side authorization.

Across every platform we tested, applications consistently enforced access control decisions, who is an admin, who can see what data, who can perform which operations, using JavaScript state in the user’s browser. This is fundamentally broken. Client-side state is entirely under the user’s control. Opening browser developer tools and changing a variable from false to true is not a sophisticated attack. It takes fifteen seconds and requires no technical background.

The pattern appeared in applications built on every platform in our research. It appeared in SaaS dashboards, file sharing apps, CRMs, and invoice generators. It appeared regardless of which backend technology the platform used. It is not a bug in any particular platform’s code generator, it is a systematic failure of AI code generation to understand the difference between “what the user sees” and “what the server trusts.”

A real attacker doesn’t need to find a clever exploit. They just need DevTools.

What live testing looked like in practice

One CRM application in our research is worth describing in concrete detail, because the numbers don’t fully capture what it felt like to watch Stage 9 run against it.

Static analysis flagged seven findings, all related to authentication being enforced only on the client side. Stage 7 confirmed all seven. Stage 9 then deployed a headless browser against the live application. In under two minutes, it had navigated to fourteen distinct protected routes, the admin panel, the user management page, the contacts list, the leads dashboard, the orders view, without providing any credentials at all. Every route served its full contents. Every probe returned HTTP 200 with real data.

The screenshots tell the story more clearly than any finding summary. A page titled “admin | PipelineFlow” with a fully rendered admin interface, captured by an unauthenticated browser session. No token. No login. No credentials of any kind. Just a direct URL and a waiting page.

Stage 10 then analyzed the source code and identified the root cause in a single configuration line: requiresAuth: false in the client configuration. One line. It generated four specific file patches, including a new authentication middleware and a diff against the existing configuration file, along with instructions for verifying the fix.

From repository URL to confirmed live exploit to working code fix, the entire process took under ten minutes.

Encryption that isn’t

The API key vault application type was particularly revealing. The whole point of a key vault is to protect sensitive credentials. When we asked multiple platforms to build one, we got three distinct approaches to encryption, all of them wrong.

One platform generated an encryption key that fell back to a hardcoded string literal when an environment variable wasn’t set. In practice, this means that any deployment where that variable is absent, which is most deployments by non-developers, encrypts all secrets with a key that’s visible in the source code. The encryption is theater.

Another platform generated encryption using a single hash function with no salt and no iteration count. This is not a key derivation function. It applies no stretching, no randomness, and no computational cost. A short or predictable encryption key is trivially recoverable.

A third platform stored secret values as plaintext text columns with no encryption at all.

Same prompt. Three different ways to fail at the one thing a secret vault must do.

Payment processing failures

Invoice applications with Stripe integration exposed a different class of vulnerability, payment logic failures rather than authentication failures.

One application accepted the payment amount directly from the client request body rather than from server-side invoice state. This means a user who intercepts the checkout request and modifies the amount field pays whatever they want. Not what’s on the invoice, whatever they type.

Another application had no webhook endpoint at all, polling for payment status via client-initiated requests rather than verified server-side events. A third skipped webhook signature verification entirely when the relevant environment variable wasn’t set, a condition that describes most deployments by non-technical users.

Five different platforms built invoice apps with Stripe integration. Five different payment handling failures. None of them the same.

The deployment-time trap

A pattern that repeated across multiple platforms and app types deserves special attention: vulnerabilities that only exist when a configuration value is absent.

Platforms build their apps with environment variable placeholders. The code that handles a missing webhook secret, or a missing admin email list, or a missing encryption key, is often the most dangerous code in the application. Non-developer users deploying these apps frequently skip configuration steps they don’t understand. The result is applications running in their most vulnerable state by default.

One file sharing application granted admin access to every authenticated user when a specific environment variable wasn’t set. A developer reading the code might notice the fallback. A vibe coder copying environment variables from a tutorial probably won’t.

The false positive problem

One more finding that belongs in this report: every major static analysis tool would have told you this research found 375 vulnerabilities. We found 68.

The difference is Stage 7. Adversarial verification doesn’t just filter noise, it builds credibility. Security teams that receive 375-finding reports learn to tune them out. Security teams that receive 68 findings, each one verified and explained, take action.

Stage 9 adds a second layer of credibility that static analysis alone cannot provide: live proof. There is a significant difference between “our analysis suggests this route may be unprotected” and “our scanner navigated to this route without credentials and here is a screenshot of what it returned.” The first is a finding. The second is evidence.

For platforms and hosting providers evaluating security tooling, this distinction is worth understanding. Raw finding counts are a vanity metric. Verified, confirmed, live-proven findings with working fixes attached are what actually get remediated.

Patterns Across the Category

Looking across all 35 applications and 7 platforms, several patterns emerge that transcend any individual tool.

Architecture determines vulnerability class. Platforms that generate a single-tier frontend-plus-database architecture produce client-side authorization failures at high rates. Platforms that generate a separated backend API layer produce fewer client-side auth failures but introduce server-side misconfigurations instead. The failure mode shifts with the architecture, but the failure rate stays high.

Platform infrastructure leaks into generated code. Several platforms inject their own architectural patterns into every app they generate, iframe communication mechanisms, custom SDK patterns, proprietary authentication flows. When those platform-level patterns have security weaknesses, every app inherits them regardless of what was prompted. This is a category of vulnerability that no amount of careful prompting can eliminate.

The deployment gap is where users get hurt. The most dangerous vulnerabilities we found weren’t in complex cryptographic implementations or sophisticated business logic. They were in the fallback code, what happens when an environment variable is missing, when a configuration step was skipped, when the user didn’t know they needed to set something up. Non-developer users deploying apps without completing technical setup steps is not an edge case. It’s the median scenario.

Functional correctness is not security correctness. One platform in our research ran a comprehensive automated test suite against its generated apps, 22 to 29 tests including negative test cases for authentication and authorization, and declared them ready for deployment. Our pipeline found confirmed vulnerabilities in those same apps. Tests that verify an app works are not tests that verify an app is secure. These are different questions that require different tools.

Static analysis without live validation leaves a gap. Across our research, Stage 9 confirmed that some findings flagged by static analysis were mitigated by infrastructure the source scan couldn’t see. Without live testing, those would have been reported as vulnerabilities when they weren’t. Equally, live testing confirmed that some findings the static analyzer flagged conservatively were in fact fully exploitable. The two stages together produce a more accurate picture than either produces alone.

What This Means

The vibe-coding category is real, it’s growing, and it’s putting applications in production that would fail a basic security review. This isn’t a criticism of any particular platform, it’s an observation about where the category is today and where it needs to go.

The platforms building these tools are not malicious. They’re optimizing for the experience that matters most to their users: getting from idea to working app as fast as possible. Security has historically been someone else’s problem, the developer’s, the DevOps team’s, the security auditor’s. In vibe-coding, there is no someone else.

Hosting providers have a role to play here. An application built by a non-developer and deployed to a hosting platform is not a WordPress site with a known vulnerability in a known plugin. The vulnerability surface is unique to that application and invisible to any scan that doesn’t understand what the code is doing. The only way to know if that application is safe is to analyze it, actually analyze it, not just check it against a list of known bad patterns.

What we’ve built goes beyond analysis. The gap between “we found a vulnerability” and “here is the code change that fixes it, verified against the live application” is exactly the gap that causes most security findings to go unaddressed. Developers don’t ignore findings because they don’t care. They ignore findings because the path from “this is a problem” to “this is fixed” is unclear, time-consuming, and often requires expertise they don’t have.

A report that says “authentication is enforced client-side” is a problem statement. A report that says “change line 7 of Client.js from requiresAuth: false to requiresAuth: true, add the attached middleware file, and verify by navigating to /admin without a token, you should see a 401″ is a solution. That’s the difference we’re building toward.

We built this pipeline because we believe that analysis layer, the full loop from static detection through live confirmation to working fix, needs to exist at the hosting layer. The research above is evidence for why.

Methodology note: All applications were built using identical prompts across platforms during May 2026. Repositories were committed to version control at build time and frozen before analysis. All scans were performed using the same pipeline version with consistent configuration. Platform names are omitted from specific findings to focus the research on category-level patterns rather than platform comparisons. The seven platforms tested represent a cross-section of the vibe-coding market as it existed at the time of research. We Watch Your Website has operated as a WordPress security company since 2007. This research represents our first published analysis of AI-generated application security.

Tom Raef is the founder of We Watch Your Website, a WordPress security company protecting over 2 million sites. The security pipeline described in this article is available for evaluation by hosting providers and platform partners.

We Built 35 AI-Generated Apps and Put Every One Through a Security Scanner

The Setup

The Pipeline

What We Found

The number that matters most

The finding that appeared everywhere

What live testing looked like in practice

Encryption that isn’t

Payment processing failures

The deployment-time trap

The false positive problem

Patterns Across the Category

What This Means

The Vibe Coding Trap: How AI-Generated WordPress Plugins Are Becoming Attack Vectors

Leave a Reply Cancel reply

The Setup

The Pipeline

What We Found

The number that matters most

The finding that appeared everywhere

What live testing looked like in practice

Encryption that isn’t

Payment processing failures

The deployment-time trap

The false positive problem

Patterns Across the Category

What This Means

Similar Posts

Leave a Reply Cancel reply