One-Shot Shop Challenge: Building a Full Shop with AI Coding Agents

Modern coding agents are very capable and offer features like sub-agents and even teammates. But how does this affect performance, token consumption, and quality? The only way to find out is to build the same project multiple times. I decided to use only a single prompt, so it's not influenced by my steering. Same specification, same technology, and (almost) the same prompt.

I prepared a comprehensive specification of ~50 pages, with acceptance criteria for every feature, but without any code snippets. The spec itself was not part of the one-shot run.

Why a shop? Simply because I have been building e-commerce platforms for years. I know what a proper implementation looks like, so I can actually evaluate the output rather than just checking if it runs.

The specification covers the typical e-commerce feature set: multi-store support, product catalog, checkout flow, payments, fulfillment, and more. You can browse the full feature overview.

Technology-wise, I decided to start with a fresh Laravel project with Livewire for dynamic UIs. Laravel is a batteries-included framework that enables the agent to build the entire system without additional libraries.

Everything is published: the specification, the prompts, the full session logs, analysis, and the resulting code. If you like, you can check out the main branch and re-run the same experiments yourself, maybe with another coding agent or even another technology (you might need to check the specs for references to Laravel).

The Setup

Specification

Full spec with acceptance criteria, no code snippets

tecsteps/shop/.../specs

Codebase

Fresh Laravel template with Livewire

MCP Servers

Laravel Boost, Playwright

QA Verification

143 E2E tests, two independent checks per build

View Testplan

The Builds

June 2026

#14 Claude Code Fable 5(Claude Code v2.1.170, Fable 5 with high reasoning, orchestration left to the agent)

Claude Code v2.1.170Fable 5High ReasoningAgent’s ChoiceSub-Agents

The first build on Fable 5, the Mythos-class model Anthropic released publicly on the same day this run started. Same one-prompt brief and the same ~50-page spec as the Opus 4.8 run (#13) right before it, with one deliberate change: where #13 was ordered into strict team mode, this run left the orchestration up to the model. The brief said to build the whole shop in one go and test it, but never how to structure the work.

6h 58m

Duration

$422.68

API Cost

195

Classes

Active Agents

Feature Tests: 142 of 143 passed (99.7%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$422.68

Time

6h 58m

Feature Completeness

out of 143 tests

Pass

142

Partial

Fail

Code Quality

LOC

10,314

Code smells

Tech debt

12.6 h

Duplication

1.5%

Teammates

sub-agents

Count

#13 Claude Code Team 4.8 xHigh(Claude Code v2.1.158, Opus 4.8 with xHigh reasoning, thinking on, 1M context, strict team mode)

Claude Code v2.1.158Opus 4.8xHigh ReasoningThinking On1M ContextTeam Mode

The first Opus 4.8 build on the page, and a direct descendant of the baseline build #1. Same idea - one team-mode agent, one one-shot prompt, the same ~50-page spec - but two model generations on (Opus 4.6 to 4.8) and dialled all the way up: xHigh reasoning, thinking on, 1M context, and a hard instruction to “implement the entire shop in one go without stopping, use team mode, and test everything via Pest.” The question was what those upgrades actually buy over the original team run.

2h 54m

Duration

$497.67

API Cost

205

Classes

Active Agents

Feature Tests: 134 of 143 passed (95.5%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$497.67

Time

2h 54m

Feature Completeness

out of 143 tests

Pass

134

Partial

Fail

Code Quality

LOC

11,230

Code smells

Tech debt

10.5 h

Duplication

0.8%

Teammates

teammates

Count

May 2026

#12 Codex GPT-5.5 Goal Mode(OpenAI Codex CLI v0.128.0, GPT-5.5 with xHigh reasoning, persistent goal mode)

Codex CLI v0.128.0gpt-5.5xHigh ReasoningPersistent GoalPlan ModeSub-AgentsCodex Pro

Same model and CLI family as build #11, but driven by Codex 0.128’s new persistent goal mode: a single multi-page brief is wrapped in an <untrusted_objective> tag and the agent loops on its own through PLAN → IMPLEMENT → VERIFY → INDEPENDENT QA → FIX → COMMIT until it calls update_goal("complete"). Reasoning was set to xHigh.

23h 49m

Duration

$530.26

API Cost

268

Classes

Active Agents

Feature Tests: 126 of 143 passed (92.0%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$530.26

Time

23h 50m

Feature Completeness

out of 143 tests

Pass

126

Partial

Fail

Code Quality

LOC

13,645

Code smells

Tech debt

23.2 h

Duplication

3.2%

Teammates

sub-agents

Count

April 2026

#11 Codex GPT-5.5(OpenAI Codex CLI v0.124, GPT-5.5 with high reasoning)

Codex CLI v0.124.0gpt-5.5High ReasoningSub-AgentsCodex Pro

First run on GPT-5.5 in Codex CLI. The user prompt this time explicitly demanded sub-agent role-play and Pest + Playwright testing in one go.

47m 38s

Duration

$18.85

API Cost

Classes

Active Agents

Feature Tests: 108 of 143 passed (82.9%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$18.85

Time

48m

Feature Completeness

out of 143 tests

Pass

108

Partial

Fail

Code Quality

LOC

1,743

Code smells

Tech debt

2.0 h

Duplication

1.3%

Teammates

sub-agents

Count

#10 Claude Code Opus 4.7 xHigh(Same setup as #09, stricter prompt to enforce team-mode)

Claude Code v2.1.114Opus 4.7xHigh ReasoningThinking On1M ContextTeam Mode

Rerun of build #9 with two changes: the prompt was hardened (“have to” replaced “must” on every rule so the instruction to use team-mode could not be rationalised away) and reasoning was raised to xHigh, which the model reportedly handles better than High. All third-party plugins, including the design plugin, were unplugged first so the output reflects Opus on its own.

1h 36m

Duration

$157.31

API Cost

151

Classes

Active Agents

Feature Tests: 51 of 143 passed (43.0%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$157.31

Time

1h 36m

Feature Completeness

out of 143 tests

Pass

Partial

Fail

Code Quality

LOC

5,043

Code smells

Tech debt

3.6 h

Duplication

1.4%

Teammates

teammates

Count

Console

Lead breaks the spec into a 12-phase plan — The lead reads the nine spec files and plans twelve phases before spawning the team.

Orchestrator task list tracking phase completion — A single task list drives phase-by-phase progress; each phase commits before the next starts.

End-of-run quality summary from the orchestrator — Final in-agent quality summary: 232 passing tests, pint clean, a fresh migrate and seed.

Orchestrator shutting down after the final Playwright review — Shutdown report: Playwright-assisted review, final commits, progress snapshot.

#09 Claude Code Opus 4.7(Same prompt as #01, latest Opus)

Claude Code v2.1.112Opus 4.7High ReasoningThinking On1M Context

Same prompt as build #1, just swapped to Opus 4.7. The run finished in 32m 27s for $22.63, by far the fastest on this page (build #1 took a full hour).

32m 27s

Duration

$22.63

API Cost

105

Classes

Active Agents

Feature Tests: 93 of 143 passed (73.4%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$22.63

Time

32m

Feature Completeness

out of 143 tests

Pass

Partial

Fail

Code Quality

LOC

2,667

Code smells

Tech debt

2.6 h

Duplication

0.4%

Teammates

single

Count

March 2026

#07 Claude Code Team v4(Same Prompt, 1M Context)

Claude Code v2.1.81Team ModeOpus 4.6High Reasoning1M Context

Same specification, same technology, and this time the same original prompt as build #1. The idea was to check whether the increased 1M token context limit plus the 42 Claude Code releases since the first run (v2.1.39 to v2.1.81) produce a better result on their own, without any prompt engineering.

3h 39m

Duration

$132.06

API Cost

389

Files Created

Active Agents

Feature Tests: 121 of 143 passed (89.9%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$132.06

Time

3h 39m

Feature Completeness

out of 143 tests

Pass

121

Partial

Fail

Code Quality

LOC

4,537

Code smells

Tech debt

8.4 h

Duplication

1.4%

Teammates

teammates

Count

#06 Claude Code Team v3(Advanced Prompt, 1M Context)

Claude Code v2.1.80Team ModeOpus 4.6High Reasoning1M Context

Same specification, same technology, but this time with an advanced prompt combining several techniques: a thorough controller supporting the team lead, BDD and TDD, code reviews, and a dedicated QA teammate that actively tries to break the system. The prompt was designed to leverage the new 1M token context window.

10h 59m

Duration

$284.52

API Cost

482

Files Created

158

Active Agents

Feature Tests: 119 of 143 passed (87.8%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$284.52

Time

10h 59m

Feature Completeness

out of 143 tests

Pass

119

Partial

Fail

Code Quality

LOC

5,708

Code smells

Tech debt

14.3 h

Duplication

8.0%

Teammates

teammates

Count

159

February 2026

#01 Claude Code with Team Mode

Claude Code v2.1.39Team ModeOpus 4.6Thinking: On

Claude Code took a phased approach. The team lead read the full specification, broke it into 12 implementation phases with explicit dependencies, and then spawned specialized agents for each area: migrations, models, Livewire components, admin panel, seeders, and so on.

1h 6m

Duration

$73.44

API Cost

388

Files Created

Active Agents

Feature Tests: 126 of 143 passed (90.6%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$73.44

Time

1h 6m

Feature Completeness

out of 143 tests

Pass

126

Partial

Fail

Code Quality

LOC

6,108

Code smells

168

Tech debt

22.3 h

Duplication

2.9%

Teammates

teammates

Count

Console

Task list showing 12 phases with dependency tracking — The team lead breaks down the specification into 12 phases with dependency tracking.

Team lead orchestrating 12 specialized agents — Specialized agents running in parallel, each responsible for a different part of the application.

Screenshots

Shop homepage with collections and featured products — Storefront homepage with collections, featured products, and newsletter signup.

Product detail page with variant selection — Product page with size variants and add-to-cart.

Multi-step checkout with order summary — Checkout with contact, shipping, and order summary.

Admin dashboard with revenue, orders, and customers — Admin dashboard with KPIs and recent orders.

Admin discount codes management — Discount code management in the admin panel.

#02 Claude Code with Sub-Agents

Claude Code v2.1.41Sub-AgentsOpus 4.6Thinking: On

Same specification, same prompt structure, but this time Claude Code ran with sub-agents instead of team mode. No specialized agent instructions were prepared - the prompt simply told it to use sub-agents. Claude spawned 20 sub-agents total, 12 of which actively contributed code.

2h 13m

Duration

$61.97

API Cost

358

Files Created

Active Agents

Feature Tests: 73 of 143 passed (60.5%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$61.97

Time

2h 13m

Feature Completeness

out of 143 tests

Pass

Partial

Fail

Code Quality

LOC

6,033

Code smells

Tech debt

8.6 h

Duplication

3.6%

Teammates

sub-agents

Count

Console

Claude Code spawning sub-agents for parallel implementation — Claude Code launches sub-agents to handle different parts of the implementation.

Screenshots

Admin dashboard with revenue and orders — Admin dashboard with KPIs and recent activity.

#03 Codex with Sub-Agents

OpenAI Codex v0.99.0Sub-Agents (experimental)GPT-5.3-codexReasoning: xhigh

Codex launched explorer agents to analyze the specification first, synthesized their findings into a phased roadmap, then delegated implementation to worker agents. The process took about 1 hour and 44 minutes with 16 sub-agents total.

1h 44m

Duration

$8.79

API Cost

Agents Spawned

357

Tool Calls

Feature Tests: 89 of 143 passed (65.7%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$8.79

Time

1h 44m

Feature Completeness

out of 143 tests

Pass

Partial

Fail

Code Quality

LOC

6,037

Code smells

Tech debt

12.7 h

Duplication

2.8%

Teammates

sub-agents

Count

Console

Codex planning phases with implementation roadmap — Codex creates a phased plan and starts dispatching explorer agents to analyze the specification.

Worker agents implementing remaining phases — Workers handling the remaining phases in parallel while the lead coordinates.

Screenshots

Shop homepage with hero, collections, and product grid — Storefront with hero banner, featured collections, and product cards.

Product detail page with variant selector — Product page with variant dropdown and related products.

Full checkout flow with shipping and payment — Checkout with shipping methods and payment options.

Admin dashboard with KPI cards — Admin dashboard with sales, orders, and product stats.

Admin discount management — Discount code management with status and usage tracking.

#04 Claude Code Team v2(More Instructions)

Claude Code v2.1.41Sub-AgentsOpus 4.6Thinking: OnReasoning: Max

Same specification, same technology, but this time with a tuned prompt and strict quality constraints. The prompt included mandatory PHPStan compliance at max level, Deptrac architectural boundary checks, Pest test coverage, QA self-verification against every acceptance criterion, and a fresh agent review cycle where a new agent instance re-evaluated the entire codebase.

3h 0m

Duration

$73.92

API Cost

376

Files Created

Active Agents

Feature Tests: 82 of 143 passed (66.8%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$73.92

Time

3h 0m

Feature Completeness

out of 143 tests

Pass

Partial

Fail

Code Quality

LOC

6,033

Code smells

Tech debt

5.2 h

Duplication

3.8%

Teammates

sub-agents

Count

Console

Claude Code with sub-agents and quality review — The agent uses sub-agents with PHPStan compliance, fresh agent code review, and QA verification steps.

Screenshots

Dark-themed storefront with collections — Storefront with dark theme, collections, and product browsing.

Multi-step checkout flow — 4-step checkout: Contact, Address, Delivery, Payment.

Admin dashboard with KPIs — Admin dashboard with sales overview and recent orders.

#05 Codex with Sub-Agents v2(More Instructions)

OpenAI Codex v0.101.0Sub-Agents (experimental)GPT-5.3-codexReasoning: xhigh

Same specification, same technology, but this time Codex received custom instructions with two additional quality tools: PHPStan (static analysis at max level) and Deptrac (architectural boundary checks). The idea was to see if giving Codex explicit quality constraints would produce measurably better code.

3h 27m

Duration

$28.40

API Cost

Agents Spawned

898

Tool Calls

Feature Tests: 70 of 143 passed (58.0%)

Live Shop Admin Panel Source Code

Build Detail Quality Features Schema

Admin login:

Efficiency

shorter = better

Cost

$28.40

Time

3h 27m

Feature Completeness

out of 143 tests

Pass

Partial

Fail

Code Quality

LOC

7,178

Code smells

113

Tech debt

25.1 h

Duplication

3.0%

Teammates

sub-agents

Count

Screenshots

Minimal storefront homepage — Reduced storefront with single product focus due to quality-first instructions.

Product detail page — Product page with variant selection.

Checkout flow — Checkout with shipping and payment options.

Admin dashboard with KPIs and recent activity.

Admin discounts management — Discount management in the admin panel.

Conclusion

Five builds. Same spec. Same baseline. Same tooling. 143 end-to-end tests. Two independent runs. One question: can an agent take a detailed spec and produce a working multi-tenant commerce platform in a single run?

Let's be clear about this, though. This is not production-ready and not a valid Shopify-clone. And this is not how agentic engineering should be done. It's an experiment to compare coding agent setups.

There Is a Clear Winner

Claude Code in Team Mode scored 85%. Second place: 57%. Last place: 37%. The gap between first and second is larger than between second and last. This was not a close race.

Team Mode Beats Sub-Agents

The decisive factor was not the model. It was orchestration. Sub-agents built great individual pieces but failed at the seams: variants that exist in the backend but never render, discounts defined in admin but not applied at checkout, orders created but not linked to customers. E-commerce is a chain of integrations. Sub-agents optimized locally. Team Mode optimized globally.

Simple Features Are Easy. Checkout Is Not.

All builds can render product listings and display collections. Very few can execute a full checkout with tax, shipping zones, discount logic, and inventory updates. Only the top build implemented magic card numbers for declined payments exactly as specified. Simple display features work everywhere. Transactional flows expose architectural weakness immediately.

Surprising Findings

Seed data was decisive. One build failed 30+ tests simply because it seeded one product instead of 20. The seeder is not boilerplate. It is the data contract between the spec and the system.

Speed hurt. The fastest build (1.5 hours) scored 51%. The slowest (8 hours) scored 85%. In a one-shot scenario, thoroughness beats speed.

Static analysis did not predict success. Builds with strict quality gates (PHPStan, Deptrac, fresh agent review) scored lower than their unconstrained counterparts. You can have zero static violations and a broken registration form.

Even at 85%, no build implemented order timelines, fulfillment progression, or postal code validation correctly. There is still a gap between strong autonomous generation and production-grade completeness.

Final Verdict

Claude Code with Team Mode is the only build where a customer can browse products, select variants, apply discounts, complete checkout with three payment methods, see decline errors, and access their order history. That is a full commerce journey.

Orchestration pattern matters more than model choice. Integration quality matters more than code volume. Seed data fidelity matters more than scaffolding speed. If you want agents to build real systems end to end, the architecture of the agents themselves is the decisive variable.