Modern coding agents are very capable and offer features like sub-agents and even teammates. But how does this affect performance, token consumption, and quality? The only way to find out is to build the same project multiple times. I decided to use only a single prompt, so it's not influenced by my steering. Same specification, same technology, and (almost) the same prompt.
I prepared a comprehensive specification of ~50 pages, with acceptance criteria for every feature, but without any code snippets. The spec itself was not part of the one-shot run.
Why a shop? Simply because I have been building e-commerce platforms for years. I know what a proper implementation looks like, so I can actually evaluate the output rather than just checking if it runs.
The specification covers the typical e-commerce feature set: multi-store support, product catalog, checkout flow, payments, fulfillment, and more. You can browse the full feature overview.
Technology-wise, I decided to start with a fresh Laravel project with Livewire for dynamic UIs. Laravel is a batteries-included framework that enables the agent to build the entire system without additional libraries.
Everything is published: the specification, the prompts, the full session logs, analysis, and the resulting code. If you like, you can check out the main branch and re-run the same experiments yourself, maybe with another coding agent or even another technology (you might need to check the specs for references to Laravel).
The Setup
Codebase
Fresh Laravel template with Livewire
MCP Servers
Laravel Boost, Playwright
The Builds
May 2026
#12 Codex GPT-5.5 Goal Mode(OpenAI Codex CLI v0.128.0, GPT-5.5 with xHigh reasoning, persistent goal mode)
Same model and CLI family as build #11, but driven by Codex 0.128’s new persistent goal mode: a single multi-page brief is wrapped in an <untrusted_objective> tag and the agent loops on its own through PLAN → IMPLEMENT → VERIFY → INDEPENDENT QA → FIX → COMMIT until it calls update_goal("complete"). Reasoning was set to xHigh.
The agent ran for 23 hours and 49 minutes on a single prompt. There were exactly three follow-up user messages in the entire log; everything else was self-driven. Codex revised its structured plan 75 times via the new update_plan tool, spawned 18 named sub-agents (15 explorers, 3 workers; Codex assigns scientist nicknames like Halley, Avicenna, Euler, …, Anscombe), and made 72 commits. The final “independent QA” phase role-played three named reviewers (Anscombe as code reviewer, Gibbs as QA analyst, Arendt as QA engineer) and then went back to fix what they flagged. 96.5% of input was cache reads, which kept the hypothetical GPT-5.5 list-price bill at $530 instead of $5K+.
Functionally this is the strongest Codex run on the page: 126 of 143 acceptance tests pass (92% weighted), regressions from #11 are all repaired (Size and Color render as separate variant groups, +/- buttons on cart, compare-at strikethrough pricing, the international shipping zone is seeded, the order timeline ships, and the customer / order seed actually matches the spec text). The other side of running for a day is the codebase: 13.6K LOC across 268 classes, the largest of any build, with a SonarCloud quality gate failure (1 bug, 1 vulnerability, 82 code smells, 23h of tech debt). One critical regression: /admin/inventory throws a PHP fatal because of a stray amphp/amp include.
Net: goal mode plus xHigh got Codex closer to a complete shop than any prior agent, but the price is a 23h run and a much larger surface to maintain. Compared with our independent QA report, the agent’s self-assessment was still optimistic - several gaps survived its own review.
23h 49m
Duration
$530.26
API Cost
268
Classes
19
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
sub-agentsApril 2026
#11 Codex GPT-5.5(OpenAI Codex CLI v0.124, GPT-5.5 with high reasoning)
First run on GPT-5.5 in Codex CLI. The user prompt this time explicitly demanded sub-agent role-play and Pest + Playwright testing in one go.
Codex technically complied: it dispatched three sub-agents at the start as QA Analyst, Backend Developer, and Frontend Developer, but every brief said “Do not edit files. Return a concise plan.” The sub-agents acted as read-only consultants and the lead did the implementation alone. Compared to build #5 on the prior Codex model (3.5 hours, 53 sub-agents), this run picked the opposite shape - a single coder finishing in 48 minutes.
The strategic call was clear: ship the storefront, stub the admin. Customer-facing flows (cart, three payment methods, discounts including the EXPIRED/MAXED edge cases, the full fulfillment lifecycle) work end to end. Most admin sections beyond Products and Orders are read-only listings - Collections, Pages, and the dedicated Analytics page have no CRUD UI. Static-analysis is the cleanest on this page (Sonar quality gate passes, 5 smells, all A ratings) - partly because the codebase is the smallest at 67 classes and 1.7K LOC.
Net: the cost / quality inverse of build #10. #10 over-scaffolded the spec and shipped dead buttons; #11 under-scaffolded the spec and the buttons it does ship work.
47m 38s
Duration
$18.85
API Cost
67
Classes
4
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
sub-agents#10 Claude Code Opus 4.7 xHigh(Same setup as #09, stricter prompt to enforce team-mode)
Rerun of build #9 with two changes: the prompt was hardened (“have to” replaced “must” on every rule so the instruction to use team-mode could not be rationalised away) and reasoning was raised to xHigh, which the model reportedly handles better than High. All third-party plugins, including the design plugin, were unplugged first so the output reflects Opus on its own.
The stricter wording worked: this run actually delegated work to a team. The lead drove a proper 12-phase plan and spawned specialised teammates (catalog, commerce, admin, storefront) across every phase, producing a codebase roughly twice the size of #09 with far broader coverage in the admin area (themes, analytics, webhooks, apps, navigation, pages) and the storefront shell. Shutdown also felt more disciplined, with a final progress snapshot and a green `php artisan test` suite.
Functionality tells a different story. The browser-level QA run found a set of deeply broken wires: the storefront Add-to-cart button has no handler, almost every admin Save silently drops its payload, customer register and login no-op, and `/login` 500s because of a missing auth.layout component. Even though individual features exist as routes, Livewire classes, and migrations, the buttons that trigger them are dead. Seed data is also much thinner than #09 (only 1 product collection, 1 customer, 1 order, no discount fixtures), which cascades into most of the cart, checkout, discount, and account suites. Headings on the home page are styled `<div>`s rather than real `<h1>` to `<h4>` tags, and variant selectors expose raw SKU codes instead of Size/Color.
Code-quality signals are solid in isolation: average maintainability index is in line with #09, no PHPMetrics violations, one SonarCloud bug, no vulnerabilities, and the usual A/A rating on security and maintainability. Reliability slips to B over a single missing iframe title and the new-code security hotspot. Most remaining SonarCloud noise comes from a single style rule flagging 25 snake_case property names.
Net: the strict prompt fixed the obvious regression from #09 (team-mode was honoured and the build felt complete from the orchestration side) but did not give us a working shop. #09 under-builds the spec and over-delivers on the pieces it does ship; #10 over-builds the spec on paper and under-delivers once a user actually clicks around. This is a reminder that “all tests pass” only covers the paths the agent thought to exercise.
1h 36m
Duration
$157.31
API Cost
151
Classes
37
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
teammatesConsole
#09 Claude Code Opus 4.7(Same prompt as #01, latest Opus)
Same prompt as build #1, just swapped to Opus 4.7. The run finished in 32m 27s for $22.63, by far the fastest on this page (build #1 took a full hour).
Two unilateral deviations stood out. Despite being told to use team-mode, Opus 4.7 built everything as a single agent, explaining: “I judged direct execution to be faster and more coherent. That is a deviation from your ‘must use team mode’ instruction.” It also stopped before the spec was complete, ending with “Ready for your feedback on which of the listed gaps to close next.” Echoes of the Mythos behaviour Anthropic flagged last week.
Scope is the trade-off: 105 classes and 2,667 LOC (vs 140-170 classes and 5-7k LOC in bigger builds), and several admin areas from build #1 (analytics detail, themes, full fulfillment lifecycle, order timeline, address book) are missing. Feature completeness: 93 passing, 24 partial, 25 failing, 1 N/A (weighted ~74%). Still feels complete in the browser.
Quality is strong: avg MI 75.3, 0 PHPMetrics violations, 0 SonarCloud bugs, all three SonarCloud ratings A, 13 code smells, 0.4% duplication. SonarCloud's gate reports failed only because one new-code security hotspot awaits review.
Impressive raw power from a single worker, but I do not like that it chose to ignore explicit instructions. A follow-up run with tighter guardrails is planned.
32m 27s
Duration
$22.63
API Cost
105
Classes
1
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
singleMarch 2026
#07 Claude Code Team v4(Same Prompt, 1M Context)
Same specification, same technology, and this time the same original prompt as build #1. The idea was to check whether the increased 1M token context limit plus the 42 Claude Code releases since the first run (v2.1.39 to v2.1.81) produce a better result on their own, without any prompt engineering.
This also serves as a direct comparison to build #6 which used a very detailed prompt with instructions on HOW to build. Here, the agent gets no process instructions at all, only the specification of WHAT to build.
The result: 121 of 143 tests pass with an 88.8% weighted score, the highest across all builds. Zero SonarCloud bugs, all A-ratings, and 12.16 estimated Halstead bugs. The simple prompt with the upgraded context window outperformed the heavily engineered prompt from build #6.
3h 39m
Duration
$132.06
API Cost
389
Files Created
34
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
teammates#06 Claude Code Team v3(Advanced Prompt, 1M Context)
Same specification, same technology, but this time with an advanced prompt combining several techniques: a thorough controller supporting the team lead, BDD and TDD, code reviews, and a dedicated QA teammate that actively tries to break the system. The prompt was designed to leverage the new 1M token context window.
The result speaks for itself: 119 of 143 E2E tests pass (83%), the highest since the original team mode run. However, this came at a steep price: about $285 in hypothetical API costs and almost 11 hours of runtime, with 176 active agents. The advanced prompt produced more robust code but consumed far more resources than any previous build.
10h 59m
Duration
$284.52
API Cost
482
Files Created
158
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
teammatesFebruary 2026
#01 Claude Code with Team Mode
Claude Code took a phased approach. The team lead read the full specification, broke it into 12 implementation phases with explicit dependencies, and then spawned specialized agents for each area: migrations, models, Livewire components, admin panel, seeders, and so on.
At peak, 31 agents were working simultaneously. The whole thing was done in just over an hour, producing 388 files across the full Laravel stack. Foundation first, then catalog, storefront, cart and checkout, payments, customer accounts, admin, search, analytics, and webhooks.
1h 6m
Duration
$73.44
API Cost
388
Files Created
31
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
teammatesConsole
Screenshots
#02 Claude Code with Sub-Agents
Same specification, same prompt structure, but this time Claude Code ran with sub-agents instead of team mode. No specialized agent instructions were prepared - the prompt simply told it to use sub-agents. Claude spawned 20 sub-agents total, 12 of which actively contributed code.
The build took about 2 hours and 13 minutes, producing 358 files. With an estimated API cost of $61.97, it sits between the team mode run ($73) and the Codex run ($8.79) in terms of cost.
2h 13m
Duration
$61.97
API Cost
358
Files Created
12
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
sub-agentsConsole
Screenshots
#03 Codex with Sub-Agents
Codex launched explorer agents to analyze the specification first, synthesized their findings into a phased roadmap, then delegated implementation to worker agents. The process took about 1 hour and 44 minutes with 16 sub-agents total.
On my first try, Codex actually finished after just a few minutes already with a surprisingly good but familiar result. That was suspicious. The shop looked exactly like the one Claude had built. Turns out, Codex found the other branch in the repo and switched to it instead of building from scratch. I had to start over with slightly adjusted instructions.
1h 44m
Duration
$8.79
API Cost
16
Agents Spawned
357
Tool Calls
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
sub-agentsConsole
Screenshots
#04 Claude Code Team v2(More Instructions)
Same specification, same technology, but this time with a tuned prompt and strict quality constraints. The prompt included mandatory PHPStan compliance at max level, Deptrac architectural boundary checks, Pest test coverage, QA self-verification against every acceptance criterion, and a fresh agent review cycle where a new agent instance re-evaluated the entire codebase.
The idea was to see if explicit quality instructions produce measurably better code. The result: significantly fewer estimated bugs (12.4 vs 18.1), higher maintainability (89.7 vs 79.7), zero SonarCloud bugs, and 77% fewer code smells. The quality gate still failed on duplication (3.8%) and unreviewed security hotspots, but the core metrics improved across the board. The trade-off was runtime: the agent took about 3 hours, partly because it did not terminate on its own when done.
A few features were deferred: search was moved to a roadmap item instead of being fully implemented. The storefront uses a dark theme this time.
3h 0m
Duration
$73.92
API Cost
376
Files Created
29
Active Agents
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
sub-agentsConsole
Screenshots
#05 Codex with Sub-Agents v2(More Instructions)
Same specification, same technology, but this time Codex received custom instructions with two additional quality tools: PHPStan (static analysis at max level) and Deptrac (architectural boundary checks). The idea was to see if giving Codex explicit quality constraints would produce measurably better code.
The result is mixed. On the positive side: zero SonarCloud bugs, zero vulnerabilities, and all A-ratings across reliability, security, and maintainability. On the other hand, the code smells more than doubled (113 vs 54) and estimated Halstead bugs jumped to 31.4 (vs 4.0 in v1). The quality focus seems to have shifted the agent towards more classes and more code, but with higher internal complexity in key controllers.
Only a single product was seeded, and the storefront UI is noticeably reduced compared to other builds. The quality-focused instructions appear to have consumed attention that would otherwise go to spec coverage and demo data. The admin panel, however, is functional with OAuth-based API authentication.
3h 27m
Duration
$28.40
API Cost
53
Agents Spawned
898
Tool Calls
Admin login:
Efficiency
shorter = betterFeature Completeness
out of 143 testsCode Quality
Teammates
sub-agentsScreenshots
Conclusion
Five builds. Same spec. Same baseline. Same tooling. 143 end-to-end tests. Two independent runs. One question: can an agent take a detailed spec and produce a working multi-tenant commerce platform in a single run?
Let's be clear about this, though. This is not production-ready and not a valid Shopify-clone. And this is not how agentic engineering should be done. It's an experiment to compare coding agent setups.
There Is a Clear Winner
Claude Code in Team Mode scored 85%. Second place: 57%. Last place: 37%. The gap between first and second is larger than between second and last. This was not a close race.
Team Mode Beats Sub-Agents
The decisive factor was not the model. It was orchestration. Sub-agents built great individual pieces but failed at the seams: variants that exist in the backend but never render, discounts defined in admin but not applied at checkout, orders created but not linked to customers. E-commerce is a chain of integrations. Sub-agents optimized locally. Team Mode optimized globally.
Simple Features Are Easy. Checkout Is Not.
All builds can render product listings and display collections. Very few can execute a full checkout with tax, shipping zones, discount logic, and inventory updates. Only the top build implemented magic card numbers for declined payments exactly as specified. Simple display features work everywhere. Transactional flows expose architectural weakness immediately.
Surprising Findings
Seed data was decisive. One build failed 30+ tests simply because it seeded one product instead of 20. The seeder is not boilerplate. It is the data contract between the spec and the system.
Speed hurt. The fastest build (1.5 hours) scored 51%. The slowest (8 hours) scored 85%. In a one-shot scenario, thoroughness beats speed.
Static analysis did not predict success. Builds with strict quality gates (PHPStan, Deptrac, fresh agent review) scored lower than their unconstrained counterparts. You can have zero static violations and a broken registration form.
Even at 85%, no build implemented order timelines, fulfillment progression, or postal code validation correctly. There is still a gap between strong autonomous generation and production-grade completeness.
Final Verdict
Claude Code with Team Mode is the only build where a customer can browse products, select variants, apply discounts, complete checkout with three payment methods, see decline errors, and access their order history. That is a full commerce journey.
Orchestration pattern matters more than model choice. Integration quality matters more than code volume. Seed data fidelity matters more than scaffolding speed. If you want agents to build real systems end to end, the architecture of the agents themselves is the decisive variable.