Building with AI Tools: What You Still Need a Human For

AI coding tools are revolutionary. But they consistently fail at the same things. Here's exactly where human judgment still matters — and why it will for years.

Every week, a founder sends me their AI-built application and asks: "Is this ready to launch?" The answer is almost always no. Not because the tools are bad — they're genuinely impressive. But because they consistently fail at the same things, and those things are exactly what matters when real peo

Every week, a founder sends me their AI-built application and asks: "Is this ready to launch?"

The answer is almost always no. Not because the tools are bad — they're genuinely impressive. But because they consistently fail at the same things, and those things are exactly what matters when real people use real software.

After auditing more than 50 AI-built applications, here are the seven areas where human judgment is still non-negotiable.

1. Security Architecture

AI tools can scaffold authentication. They'll add login forms, generate JWT tokens, and create protected routes. But the implementation almost always has one or more of these problems:

Tokens stored in localStorage (accessible to any JavaScript on the page). Sessions that never expire. Password reset flows that don't verify email ownership. Admin routes "protected" by hiding the menu item rather than enforcing server-side checks.

The AI generates code that looks secure. It uses the right function names. It follows patterns from training data. But it doesn't understand threat models. It doesn't think about what happens when someone deliberately tries to break in.

Security architecture requires understanding both what legitimate users do and what malicious actors try. AI tools are trained on the former and blind to the latter.

2. Payment Edge Cases

Stripe integration is the most common request. AI tools handle the happy path well: user clicks buy, payment processes, access granted.

But real payment systems need to handle: failed payments with retry logic, subscription downgrades that calculate proration, refunds that revoke the right access, webhook events that arrive out of order, network failures during payment processing, currency conversion edge cases, and tax calculation by jurisdiction.

I recently audited an AI-built SaaS application where the Stripe webhook handler didn't verify webhook signatures. This means anyone could send fake payment confirmation events to the server and get free access to paid features. The AI generated the webhook handler. It just didn't generate the security verification.

3. Data Integrity

AI-generated code typically handles data as if everything always goes right. User submits form. Data saves. Success message appears.

Real applications need to handle: concurrent edits to the same record, network failures mid-save, database constraint violations, foreign key relationships during deletions, data migration between schema versions, and backup and recovery procedures.

The boring, invisible work of ensuring data is always consistent is something AI tools almost never get right. They generate the happy path and skip the failure modes.

4. Multi-Tenant Isolation

If your application serves multiple organisations (most B2B SaaS does), each organisation's data must be completely isolated from every other's. A user from Company A must never see data from Company B.

AI tools understand this conceptually. They'll add a tenant_id column to your tables. But they won't implement row-level security policies. They won't add tenant-scoped API middleware. They won't create audit logs that track data access by tenant.

The result is applications where data isolation exists in theory but fails in practice. A creative URL manipulation or API call can expose data across tenants. This isn't a minor bug — it's a business-ending security incident.

5. Error Recovery

When AI-generated code encounters an error, it typically does one of two things: crash silently, or show a generic error message. Neither is acceptable for production software.

Production applications need: graceful degradation when services are unavailable, meaningful error messages that help users fix the problem, automatic retry logic for transient failures, error tracking and alerting for the development team, and fallback behaviours that keep the application usable even when parts fail.

This is the invisible work that makes software feel professional. Users don't notice good error handling — they notice when it's missing.

6. Compliance

Depending on your industry and location, your application may need to comply with GDPR, SOC 2, HIPAA, PCI DSS, or other regulatory frameworks. AI tools don't understand compliance requirements because they're not technical problems — they're legal and business problems that manifest in technical decisions.

GDPR alone requires: data subject access requests, right to deletion (including backups), consent management, data processing agreements, cross-border data transfer compliance, and breach notification procedures.

None of this can be prompted into an AI tool. It requires understanding the regulations and translating them into technical requirements.

7. Performance Under Load

AI-generated applications work fine with one user. They work fine with ten. Somewhere between a hundred and a thousand users, performance problems appear.

Database queries that were fast with 100 rows become slow with 100,000. API endpoints that returned instantly start timing out. Memory usage grows until the server crashes.

Performance optimisation requires understanding how databases query data, how servers handle concurrent requests, how caching reduces load, and how to measure and identify bottlenecks. AI tools generate code that works. They don't generate code that works efficiently at scale.

Why These Won't Be Fixed by Better AI

These aren't temporary limitations that the next model version will solve. They're fundamental to how LLMs work with code.

LLMs predict what code probably looks like based on training data. They're pattern matchers, not reasoning engines. The failures above all require reasoning about things that aren't in the code: threat models, business rules, regulatory requirements, and real-world failure modes.

The tools will get better. They'll catch more edge cases. They'll generate more secure defaults. But the gap between "code that looks like it works" and "code that actually works in production" will remain, because that gap is about understanding context that doesn't exist in the codebase.

What This Means for Founders

This isn't an argument against AI coding tools. I use them every day. They've made me 3-5x more productive.

It's an argument for using AI tools in combination with human judgment, not as a replacement for it. The best results come from:

AI for speed: generating boilerplate, scaffolding features, iterating on UI.

Humans for decisions: security architecture, payment logic, data integrity, compliance, and the hundred small decisions that determine whether software survives contact with real users.

The founders who understand this build faster and ship better products than those who try to go purely AI or purely human. The combination is the superpower.

---

Related reading

  • The Final 10% That AI Can't Build
  • Agentic Engineering Explained
  • Why Your Methodology Is the One Thing AI Can't Replicate
  • The Vibe Coding Reality Check