One-Shot Shop Challenge

Let’s build a full-featured online shop (roughly Shopify-level) from a single prompt. We’ll repeat it with different coding agents so we can compare them.

Fabian Wesner
Fabian Wesner

February 2026


April 2026

March 2026

February 2026


Modern coding agents are very capable and offer features like sub-agents and even teammates. But how does this affect performance, token consumption, and quality? The only way to find out is to build the same project multiple times. I decided to use only a single prompt, so it's not influenced by my steering. Same specification, same technology, and (almost) the same prompt.

I prepared a comprehensive specification of ~50 pages, with acceptance criteria for every feature, but without any code snippets. The spec itself was not part of the one-shot run.

Why a shop? Simply because I have been building e-commerce platforms for years. I know what a proper implementation looks like, so I can actually evaluate the output rather than just checking if it runs.

The specification covers the typical e-commerce feature set: multi-store support, product catalog, checkout flow, payments, fulfillment, and more. You can browse the full feature overview.

Technology-wise, I decided to start with a fresh Laravel project with Livewire for dynamic UIs. Laravel is a batteries-included framework that enables the agent to build the entire system without additional libraries.

Everything is published: the specification, the prompts, the full session logs, analysis, and the resulting code. If you like, you can check out the main branch and re-run the same experiments yourself, maybe with another coding agent or even another technology (you might need to check the specs for references to Laravel).

The Setup

Specification

Full spec with acceptance criteria, no code snippets

tecsteps/shop/.../specs

Codebase

Fresh Laravel template with Livewire

MCP Servers

Laravel Boost, Playwright

QA Verification

143 E2E tests, two independent checks per build

View Testplan

The Builds

April 2026

Claude

#09 Claude Code Opus 4.7(Same prompt as #01, latest Opus)

Claude Code v2.1.112Opus 4.7High ReasoningThinking On1M Context

Same prompt as build #1, just swapped to Opus 4.7. The run finished in 32m 27s for $22.63, by far the fastest on this page (build #1 took a full hour).

Two unilateral deviations stood out. Despite being told to use team-mode, Opus 4.7 built everything as a single agent, explaining: “I judged direct execution to be faster and more coherent. That is a deviation from your ‘must use team mode’ instruction.” It also stopped before the spec was complete, ending with “Ready for your feedback on which of the listed gaps to close next.” Echoes of the Mythos behaviour Anthropic flagged last week.

Scope is the trade-off: 105 classes and 2,667 LOC (vs 140-170 classes and 5-7k LOC in bigger builds), and several admin areas from build #1 (analytics detail, themes, full fulfillment lifecycle, order timeline, address book) are missing. Feature completeness: 93 passing, 24 partial, 25 failing, 1 N/A (weighted ~74%). Still feels complete in the browser.

Quality is strong: avg MI 75.4, 0 PHPMetrics violations, 0 SonarCloud bugs, all three SonarCloud ratings A, 13 code smells, 0.6% duplication. SonarCloud's gate reports failed only because three new-code security hotspots await review.

Impressive raw power from a single worker, but I do not like that it chose to ignore explicit instructions. A follow-up run with tighter guardrails is planned.

32m 27s

Duration

$22.63

API Cost

105

Classes

1

Active Agents

Feature Tests: 93 of 143 passed (65%)

Admin login:

Efficiency

shorter = better
Cost
$22.63
Time
32m

Feature Completeness

out of 143 tests
Pass
93
Partial
24
Fail
25

Code Quality

LOC
2,667
Code smells
13
Tech debt
2.6 h
Duplication
0.6%

Teammates

single
Count
1

Claude

#08 Claude Code Team(Adaptive Thinking Off)

Claude Code v2.1.104Team ModeOpus 4.6High Reasoning1M ContextAdaptive Thinking Off

Same prompt, same setup as build #1 and build #7, but this time with adaptive thinking explicitly disabled. Recently, adaptive thinking has been a suspected root cause for the model dialling back its reasoning mid-run and producing weaker output. Would forcing Opus to stay in high reasoning mode fix that?

Short answer: no. The run landed at 106 of 143 tests passing with a 77.4% weighted score, meaningfully below build #1 (88.1%) and build #7 (88.8%) with the exact same prompt. Quality metrics are in the same ballpark (avgMI 87.8, 0 bugs, all SonarCloud A-ratings, 11.6 Halstead bugs), but feature completeness took a clear hit: product and discount admin save flows return HTTP 500, cart discount codes are silently ignored, and there is no stock enforcement in the cart.

Efficiency landed in the middle: 2h 27m runtime with $105.73 in API cost. Faster and cheaper than build #7, but slower and more expensive than build #1 with worse output than both. On this specific prompt, adaptive thinking off did not help; it made the team finish with more half-built features, especially around admin forms and discount handling.

Digging into the session logs surfaced something unexpected. On build #8 every single one of the 268 lead API calls carried a thinking block. On build #1 (Claude Code v2.1.39, before adaptive thinking even existed) and build #7 (v2.1.81), the lead only sent thinking on roughly 1 in 5 calls. The other 80% were tool-result continuation calls that skipped thinking entirely. So this 100% isn't "adaptive off beating adaptive on". It's Claude Code's own per-call logic about when to include a thinking block, and forcing "high + adaptive off" made it emit thinking on every call.

Team lead thinking rate per API request

BuildAPI requests w/ thinkingRate
#1 (v2.1.39, pre-adaptive)30 of 14121%
#7 (v2.1.81, default)23 of 11320%
#8 (v2.1.104, adaptive off)268 of 268100%

Teammates, which do most of the actual coding, ran zero thinking blocks on every build. Across 29 teammates and 4974 assistant messages in build #8 there is not a single thinking block. Same pattern in builds #1 and #7.

Teammate thinking blocks

BuildTeammates sampledThinking blocks
#1 (v2.1.39)3 sampled0 of 19
#7 (v2.1.81)3 sampled0 of 1541
#8 (v2.1.104)29 total (4974 msgs)0 of 4974

Claude Code team mode does not propagate the lead's thinking config down to the teammates. Forcing high on the lead only affected orchestration decisions, not the implementation work. That is likely why the switch did not move the needle on feature completeness or quality.

2h 28m

Duration

$105.73

API Cost

352

Files Created

29

Active Agents

Feature Tests: 106 of 143 passed (74%)

Admin login:

Efficiency

shorter = better
Cost
$105.73
Time
2h 28m

Feature Completeness

out of 143 tests
Pass
106
Partial
11
Fail
26

Code Quality

LOC
4,961
Code smells
75
Tech debt
9.3 h
Duplication
3.1%

Teammates

teammates
Count
30

March 2026

Claude

#07 Claude Code Team v4(Same Prompt, 1M Context)

Claude Code v2.1.81Team ModeOpus 4.6High Reasoning1M Context

Same specification, same technology, and this time the same original prompt as build #1. The idea was to check whether the increased 1M token context limit plus the 42 Claude Code releases since the first run (v2.1.39 to v2.1.81) produce a better result on their own, without any prompt engineering.

This also serves as a direct comparison to build #6 which used a very detailed prompt with instructions on HOW to build. Here, the agent gets no process instructions at all, only the specification of WHAT to build.

The result: 121 of 143 tests pass with an 88.8% weighted score, the highest across all builds. Zero SonarCloud bugs, all A-ratings, and 12.16 estimated Halstead bugs. The simple prompt with the upgraded context window outperformed the heavily engineered prompt from build #6.

3h 39m

Duration

$132.06

API Cost

389

Files Created

34

Active Agents

Feature Tests: 121 of 143 passed (85%)

Admin login:

Efficiency

shorter = better
Cost
$132.06
Time
3h 39m

Feature Completeness

out of 143 tests
Pass
121
Partial
15
Fail
7

Code Quality

LOC
4,537
Code smells
137
Tech debt
24.8 h
Duplication
5.9%

Teammates

teammates
Count
35

Claude

#06 Claude Code Team v3(Advanced Prompt, 1M Context)

Claude Code v2.1.80Team ModeOpus 4.6High Reasoning1M Context

Same specification, same technology, but this time with an advanced prompt combining several techniques: a thorough controller supporting the team lead, BDD and TDD, code reviews, and a dedicated QA teammate that actively tries to break the system. The prompt was designed to leverage the new 1M token context window.

The result speaks for itself: 119 of 143 E2E tests pass (83%), the highest since the original team mode run. However, this came at a steep price: about $285 in hypothetical API costs and almost 11 hours of runtime, with 176 active agents. The advanced prompt produced more robust code but consumed far more resources than any previous build.

10h 59m

Duration

$284.52

API Cost

482

Files Created

158

Active Agents

Feature Tests: 119 of 143 passed (83%)

Admin login:

Efficiency

shorter = better
Cost
$284.52
Time
10h 59m

Feature Completeness

out of 143 tests
Pass
119
Partial
13
Fail
7

Code Quality

LOC
5,708
Code smells
91
Tech debt
14.3 h
Duplication
8.0%

Teammates

teammates
Count
159

February 2026

Claude

#01 Claude Code with Team Mode

Claude Code v2.1.39Team ModeOpus 4.6Thinking: On

Claude Code took a phased approach. The team lead read the full specification, broke it into 12 implementation phases with explicit dependencies, and then spawned specialized agents for each area: migrations, models, Livewire components, admin panel, seeders, and so on.

At peak, 31 agents were working simultaneously. The whole thing was done in just over an hour, producing 388 files across the full Laravel stack. Foundation first, then catalog, storefront, cart and checkout, payments, customer accounts, admin, search, analytics, and webhooks.

1h 6m

Duration

$73.44

API Cost

388

Files Created

31

Active Agents

Feature Tests: 126 of 143 passed (88%)

Admin login:

Efficiency

shorter = better
Cost
$73.44
Time
1h 6m

Feature Completeness

out of 143 tests
Pass
126
Partial
7
Fail
9

Code Quality

LOC
6,108
Code smells
168
Tech debt
22.3 h
Duplication
2.9%

Teammates

teammates
Count
32
Console
The team lead breaks down the specification into 12 phases with dependency tracking.
Specialized agents running in parallel, each responsible for a different part of the application.
Screenshots
Storefront homepage with collections, featured products, and newsletter signup.
Product page with size variants and add-to-cart.
Checkout with contact, shipping, and order summary.
Admin dashboard with KPIs and recent orders.
Discount code management in the admin panel.

Claude

#02 Claude Code with Sub-Agents

Claude Code v2.1.41Sub-AgentsOpus 4.6Thinking: On

Same specification, same prompt structure, but this time Claude Code ran with sub-agents instead of team mode. No specialized agent instructions were prepared - the prompt simply told it to use sub-agents. Claude spawned 20 sub-agents total, 12 of which actively contributed code.

The build took about 2 hours and 13 minutes, producing 358 files. With an estimated API cost of $61.97, it sits between the team mode run ($73) and the Codex run ($8.79) in terms of cost.

2h 13m

Duration

$61.97

API Cost

358

Files Created

12

Active Agents

Feature Tests: 73 of 143 passed (51%)

Admin login:

Efficiency

shorter = better
Cost
$61.97
Time
2h 13m

Feature Completeness

out of 143 tests
Pass
73
Partial
27
Fail
43

Code Quality

LOC
6,033
Code smells
102
Tech debt
13.4 h
Duplication
4.8%

Teammates

sub-agents
Count
13
Console
Claude Code launches sub-agents to handle different parts of the implementation.
Screenshots
Storefront homepage with collections, products, and search.
Product page with variant options and add-to-cart.
Checkout with shipping, payment, and order summary.
Admin dashboard with KPIs and recent activity.
Discount code management in the admin panel.

OpenAI

#03 Codex with Sub-Agents

OpenAI Codex v0.99.0Sub-Agents (experimental)GPT-5.3-codexReasoning: xhigh

Codex launched explorer agents to analyze the specification first, synthesized their findings into a phased roadmap, then delegated implementation to worker agents. The process took about 1 hour and 44 minutes with 16 sub-agents total.

On my first try, Codex actually finished after just a few minutes already with a surprisingly good but familiar result. That was suspicious. The shop looked exactly like the one Claude had built. Turns out, Codex found the other branch in the repo and switched to it instead of building from scratch. I had to start over with slightly adjusted instructions.

1h 44m

Duration

$8.79

API Cost

16

Agents Spawned

357

Tool Calls

Feature Tests: 89 of 143 passed (62%)

Admin login:

Efficiency

shorter = better
Cost
$8.79
Time
1h 44m

Feature Completeness

out of 143 tests
Pass
89
Partial
10
Fail
39

Code Quality

LOC
6,037
Code smells
54
Tech debt
12.7 h
Duplication
2.8%

Teammates

sub-agents
Count
17
Console
Codex creates a phased plan and starts dispatching explorer agents to analyze the specification.
Workers handling the remaining phases in parallel while the lead coordinates.
Screenshots
Storefront with hero banner, featured collections, and product cards.
Product page with variant dropdown and related products.
Checkout with shipping methods and payment options.
Admin dashboard with sales, orders, and product stats.
Discount code management with status and usage tracking.

Claude

#04 Claude Code Team v2(More Instructions)

Claude Code v2.1.41Sub-AgentsOpus 4.6Thinking: OnReasoning: Max

Same specification, same technology, but this time with a tuned prompt and strict quality constraints. The prompt included mandatory PHPStan compliance at max level, Deptrac architectural boundary checks, Pest test coverage, QA self-verification against every acceptance criterion, and a fresh agent review cycle where a new agent instance re-evaluated the entire codebase.

The idea was to see if explicit quality instructions produce measurably better code. The result: significantly fewer estimated bugs (12.4 vs 18.1), higher maintainability (89.7 vs 79.7), zero SonarCloud bugs, and 60% fewer code smells. The quality gate still failed on duplication (7.5%) and unreviewed security hotspots, but the core metrics improved across the board. The trade-off was runtime: the agent took about 3 hours, partly because it did not terminate on its own when done.

A few features were deferred: search was moved to a roadmap item instead of being fully implemented. The storefront uses a dark theme this time.

3h 0m

Duration

$73.92

API Cost

376

Files Created

29

Active Agents

Feature Tests: 82 of 143 passed (57%)

Admin login:

Efficiency

shorter = better
Cost
$73.92
Time
3h 0m

Feature Completeness

out of 143 tests
Pass
82
Partial
27
Fail
30

Code Quality

LOC
6,033
Code smells
64
Tech debt
8.7 h
Duplication
7.5%

Teammates

sub-agents
Count
30
Console
The agent uses sub-agents with PHPStan compliance, fresh agent code review, and QA verification steps.
Screenshots
Storefront with dark theme, collections, and product browsing.
Product page with SKU variants and add-to-cart.
4-step checkout: Contact, Address, Delivery, Payment.
Admin dashboard with sales overview and recent orders.
Discount code management in the admin panel.

OpenAI

#05 Codex with Sub-Agents v2(More Instructions)

OpenAI Codex v0.101.0Sub-Agents (experimental)GPT-5.3-codexReasoning: xhigh

Same specification, same technology, but this time Codex received custom instructions with two additional quality tools: PHPStan (static analysis at max level) and Deptrac (architectural boundary checks). The idea was to see if giving Codex explicit quality constraints would produce measurably better code.

The result is mixed. On the positive side: zero SonarCloud bugs, zero vulnerabilities, and all A-ratings across reliability, security, and maintainability. On the other hand, the code smells tripled (153 vs 54) and estimated Halstead bugs jumped to 31.4 (vs 4.0 in v1). The quality focus seems to have shifted the agent towards more classes and more code, but with higher internal complexity in key controllers.

Only a single product was seeded, and the storefront UI is noticeably reduced compared to other builds. The quality-focused instructions appear to have consumed attention that would otherwise go to spec coverage and demo data. The admin panel, however, is functional with OAuth-based API authentication.

3h 27m

Duration

$28.40

API Cost

53

Agents Spawned

898

Tool Calls

Feature Tests: 70 of 143 passed (49%)

Admin login:

Efficiency

shorter = better
Cost
$28.40
Time
3h 27m

Feature Completeness

out of 143 tests
Pass
70
Partial
26
Fail
39

Code Quality

LOC
7,178
Code smells
153
Tech debt
33.3 h
Duplication
4.2%

Teammates

sub-agents
Count
54
Screenshots
Reduced storefront with single product focus due to quality-first instructions.
Product page with variant selection.
Checkout with shipping and payment options.
Admin dashboard with KPIs and recent activity.
Discount management in the admin panel.

Conclusion

Five builds. Same spec. Same baseline. Same tooling. 143 end-to-end tests. Two independent runs. One question: can an agent take a detailed spec and produce a working multi-tenant commerce platform in a single run?

Let's be clear about this, though. This is not production-ready and not a valid Shopify-clone. And this is not how agentic engineering should be done. It's an experiment to compare coding agent setups.

There Is a Clear Winner

Claude Code in Team Mode scored 85%. Second place: 57%. Last place: 37%. The gap between first and second is larger than between second and last. This was not a close race.

Team Mode Beats Sub-Agents

The decisive factor was not the model. It was orchestration. Sub-agents built great individual pieces but failed at the seams: variants that exist in the backend but never render, discounts defined in admin but not applied at checkout, orders created but not linked to customers. E-commerce is a chain of integrations. Sub-agents optimized locally. Team Mode optimized globally.

Simple Features Are Easy. Checkout Is Not.

All builds can render product listings and display collections. Very few can execute a full checkout with tax, shipping zones, discount logic, and inventory updates. Only the top build implemented magic card numbers for declined payments exactly as specified. Simple display features work everywhere. Transactional flows expose architectural weakness immediately.

Surprising Findings

Seed data was decisive. One build failed 30+ tests simply because it seeded one product instead of 20. The seeder is not boilerplate. It is the data contract between the spec and the system.

Speed hurt. The fastest build (1.5 hours) scored 51%. The slowest (8 hours) scored 85%. In a one-shot scenario, thoroughness beats speed.

Static analysis did not predict success. Builds with strict quality gates (PHPStan, Deptrac, fresh agent review) scored lower than their unconstrained counterparts. You can have zero static violations and a broken registration form.

Even at 85%, no build implemented order timelines, fulfillment progression, or postal code validation correctly. There is still a gap between strong autonomous generation and production-grade completeness.

Final Verdict

Claude Code with Team Mode is the only build where a customer can browse products, select variants, apply discounts, complete checkout with three payment methods, see decline errors, and access their order history. That is a full commerce journey.

Orchestration pattern matters more than model choice. Integration quality matters more than code volume. Seed data fidelity matters more than scaffolding speed. If you want agents to build real systems end to end, the architecture of the agents themselves is the decisive variable.

Fabian Wesner

Enthusiastic Berlin-based entrepreneur. Former CTO at Rocket Internet and Project A. Co-founded Spryker and raised millions with ROQ. Today, SMEs and enterprises book me to help them adopt agentic engineering and leverage AI across all departments. I'm also looking for an exceptional founder team to join as tech co-founder and build a unicorn.