We Built 35 AI-Generated Apps and Put Every One Through a Security Scanner
By Tom Raef, Founder — We Watch Your Website
There’s a moment in every security researcher’s career when something stops being theoretical and becomes undeniably real. For me, that moment came about halfway through this project, when I realized that every single platform we tested — without exception — was telling its users their app was ready to deploy. Our scanner told a different story.
This is that story.
The Setup
We Watch Your Website has spent 18 years protecting WordPress sites. We know what malicious code looks like, how attackers get in, and what developers miss when they’re moving fast. When AI-generated application platforms started gaining real traction, we got curious: what does our scanner see when it looks at code a non-developer built in an afternoon?
To find out, we built a research series around a simple premise. Take 5 representative web applications — the kind a small business or startup might actually need — and build each one on 7 different AI platforms using identical prompts. Same apps, same requirements, same inputs. Then run every one through our automated security pipeline and see what comes out.
The five applications:
- A SaaS dashboard with authentication and an admin panel
- An API key manager and secret vault
- A file upload and sharing service
- An invoice generator with Stripe payment integration
- A CRM with role-based access and data export
Thirty-five applications total. Thousands of lines of generated code. All of it analyzed by the same pipeline, under the same conditions, with the same methodology.
The Pipeline
Before getting into findings, it’s worth explaining how we analyze code — because the methodology matters as much as the results.
Our pipeline runs 8 stages against every application. Early stages scan for committed secrets, run static analysis rules, check dependencies against vulnerability databases, and audit packages for abandonment. Stage 4 uses a large language model to perform deep code review with full multi-file context — tracking tainted user input across file boundaries to identify where it reaches dangerous sinks, something single-file scanners miss entirely. This isn’t pattern matching. It’s actually understanding what the code is trying to do and where it fails.
Then comes Stage 7.
Stage 7 is what separates our tool from every static scanner on the market. After the first six stages generate findings, Stage 7 subjects each one to adversarial verification. A second LLM pass asks a pointed question: is this finding real and exploitable, or is it noise? Every finding gets challenged before it’s confirmed. The result is a dramatic reduction in false positives — across this research, Stage 7 dismissed 307 of 375 total findings as false positives. That’s an 82% noise reduction rate.
Without Stage 7, we’d have published 375 scary-sounding findings and called it a day. With it, we published 68 confirmed, verified, exploitable vulnerabilities. The difference matters enormously. A tool that cries wolf trains developers to ignore alerts. A tool that only speaks when it has something real to say gets listened to.
Stage 8 takes the verified findings and constructs attack chains — multi-step exploitation scenarios showing how individual findings combine into real-world attacks. Where a single finding might look manageable in isolation, an attack chain shows a CTO exactly how an attacker moves from zero access to full compromise in a sequence of concrete steps.
What We Found
The number that matters most
Thirty-five applications analyzed. One clean pass.
Every other application had at least one confirmed security finding. Six were serious enough that our pipeline would have blocked deployment entirely. The platforms building those applications all reported the apps were ready to go live.
That gap — between what the platform says and what independent analysis finds — is the core finding of this research.
The finding that appeared everywhere
If there is one vulnerability that defines AI-generated web applications in 2026, it is client-side authorization.
Across every platform we tested, applications consistently enforced access control decisions — who is an admin, who can see what data, who can perform which operations — using JavaScript state in the user’s browser. This is fundamentally broken. Client-side state is entirely under the user’s control. Opening browser developer tools and changing a variable from false to true is not a sophisticated attack. It takes fifteen seconds and requires no technical background.
The pattern appeared in applications built on every platform in our research. It appeared in SaaS dashboards, file sharing apps, CRMs, and invoice generators. It appeared regardless of which backend technology the platform used. It is not a bug in any particular platform’s code generator — it is a systematic failure of AI code generation to understand the difference between “what the user sees” and “what the server trusts.”
A real attacker doesn’t need to find a clever exploit. They just need DevTools.
Encryption that isn’t
The API key vault application type was particularly revealing. The whole point of a key vault is to protect sensitive credentials. When we asked multiple platforms to build one, we got three distinct approaches to encryption — all of them wrong.
One platform generated an encryption key that fell back to a hardcoded string literal when an environment variable wasn’t set. In practice, this means that any deployment where that variable is absent — which is most deployments by non-developers — encrypts all secrets with a key that’s visible in the source code. The encryption is theater.
Another platform generated encryption using a single hash function with no salt and no iteration count. This is not a key derivation function. It applies no stretching, no randomness, and no computational cost. A short or predictable encryption key is trivially recoverable.
A third platform stored secret values as plaintext text columns with no encryption at all.
Same prompt. Three different ways to fail at the one thing a secret vault must do.
Payment processing failures
Invoice applications with Stripe integration exposed a different class of vulnerability — payment logic failures rather than authentication failures.
One application accepted the payment amount directly from the client request body rather than from server-side invoice state. This means a user who intercepts the checkout request and modifies the amount field pays whatever they want. Not what’s on the invoice — whatever they type.
Another application had no webhook endpoint at all, polling for payment status via client-initiated requests rather than verified server-side events. A third skipped webhook signature verification entirely when the relevant environment variable wasn’t set — a condition that describes most deployments by non-technical users.
Five different platforms built invoice apps with Stripe integration. Five different payment handling failures. None of them the same.
The deployment-time trap
A pattern that repeated across multiple platforms and app types deserves special attention: vulnerabilities that only exist when a configuration value is absent.
Platforms build their apps with environment variable placeholders. The code that handles a missing webhook secret, or a missing admin email list, or a missing encryption key, is often the most dangerous code in the application. Non-developer users deploying these apps frequently skip configuration steps they don’t understand. The result is applications running in their most vulnerable state by default.
One file sharing application granted admin access to every authenticated user when a specific environment variable wasn’t set. A developer reading the code might notice the fallback. A vibe coder copying environment variables from a tutorial probably won’t.
The false positive problem
One more finding that belongs in this report: every major static analysis tool would have told you this research found 375 vulnerabilities. We found 68.
The difference is Stage 7. Adversarial verification doesn’t just filter noise — it builds credibility. Security teams that receive 375-finding reports learn to tune them out. Security teams that receive 68 findings, each one verified and explained, take action.
For platforms and hosting providers evaluating security tooling, this distinction is worth understanding. Raw finding counts are a vanity metric. Verified, actionable findings are what actually get fixed.
Patterns Across the Category
Looking across all 35 applications and 7 platforms, several patterns emerge that transcend any individual tool.
Architecture determines vulnerability class. Platforms that generate a single-tier frontend-plus-database architecture produce client-side authorization failures at high rates. Platforms that generate a separated backend API layer produce fewer client-side auth failures but introduce server-side misconfigurations instead. The failure mode shifts with the architecture, but the failure rate stays high.
Platform infrastructure leaks into generated code. Several platforms inject their own architectural patterns into every app they generate — iframe communication mechanisms, custom SDK patterns, proprietary authentication flows. When those platform-level patterns have security weaknesses, every app inherits them regardless of what was prompted. This is a category of vulnerability that no amount of careful prompting can eliminate.
The deployment gap is where users get hurt. The most dangerous vulnerabilities we found weren’t in complex cryptographic implementations or sophisticated business logic. They were in the fallback code — what happens when an environment variable is missing, when a configuration step was skipped, when the user didn’t know they needed to set something up. Non-developer users deploying apps without completing technical setup steps is not an edge case. It’s the median scenario.
Functional correctness is not security correctness. One platform in our research ran a comprehensive automated test suite against its generated apps — 22 to 29 tests including negative test cases for authentication and authorization — and declared them ready for deployment. Our pipeline found confirmed vulnerabilities in those same apps. Tests that verify an app works are not tests that verify an app is secure. These are different questions that require different tools.
What This Means
The vibe-coding category is real, it’s growing, and it’s putting applications in production that would fail a basic security review. This isn’t a criticism of any particular platform — it’s an observation about where the category is today and where it needs to go.
The platforms building these tools are not malicious. They’re optimizing for the experience that matters most to their users: getting from idea to working app as fast as possible. Security has historically been someone else’s problem — the developer’s, the DevOps team’s, the security auditor’s. In vibe-coding, there is no someone else.
Hosting providers have a role to play here. An application built by a non-developer and deployed to a hosting platform is not a WordPress site with a known vulnerability in a known plugin. The vulnerability surface is unique to that application and invisible to any scan that doesn’t understand what the code is doing. The only way to know if that application is safe is to analyze it — actually analyze it, not just check it against a list of known bad patterns.
We built this pipeline because we believe that analysis layer needs to exist. The research above is evidence for why.
Methodology note: All applications were built using identical prompts across platforms during May 2026. Repositories were committed to version control at build time and frozen before analysis. All scans were performed using the same pipeline version with consistent configuration. Platform names are omitted from specific findings to focus the research on category-level patterns rather than platform comparisons. The seven platforms tested represent a cross-section of the vibe-coding market as it existed at the time of research. We Watch Your Website has operated as a WordPress security company since 2007. This research represents our first published analysis of AI-generated application security.
Tom Raef is the founder of We Watch Your Website, a WordPress security company protecting over 2 million sites. The security pipeline described in this article is available for evaluation by hosting providers and platform partners.
