Benchmark · Run 2026-06-12 · Living document

Same model. Same question.
One-tenth the tokens.

Three SEC research tasks, each run twice by the same agent — once with the edgar.tools MCP server, once with hosted agentic web search. The MCP arm used 10–21× fewer tokens, cost ~10–21× less, finished 4–7× faster — and scored higher on every answer.

A cheap wrong answer is not efficiency, so an independent LLM judge scored every run against a hand-verified reference answer. Every number on this page is the median of three runs, date-stamped, with the per-run records published below.

tokens 10.0–20.8×· cost 9.9–21.5×· latency 4.3–6.9×· cells won by web search 0 of 9

Three tasks, two arms, nine cells — web search won none.

Each task ran three times per arm on Claude Sonnet 4.6. Tokens are total input + output across the agent loop — each turn re-reads the whole context, which is the real cost an agent pays. Correctness is an LLM judge scoring against a hand-verified reference.

Material-weakness screen · last 6 months
20.8×
tokens, web ÷ MCP
MCP9.0 / 10
web4.2 / 10
Sales-call account brief · Delta Air Lines
10.0×
tokens, web ÷ MCP
MCP7.9 / 10
web3.7 / 10
Strategic priorities · latest 10-K vs prior year
10.2×
tokens, web ÷ MCP
MCP9.1 / 10
web2.3 / 10
Median of 3 runs per cell. Agent model claude-sonnet-4-6 both arms, prompt caching disabled. Cost = model tokens at the 2026-06-11 pricing snapshot, plus $10 per 1,000 web searches in the web arm.
Task Tokens · MCP Tokens · web Ratio Cost · MCP Cost · web Correct · MCP Correct · web Latency · MCP Latency · web
Companies disclosing ICFR material weakness or Item 4.02 restatement, last 6 months 116,5792,420,11720.8× $0.40$8.65 9.0 σ 04.2 σ 2.0 100s690s
One-page account brief for a sales call with Delta Air Lines 74,998750,73510.0× $0.26$2.61 7.9 σ 0.73.7 σ 0.5 61s312s
Delta's stated strategic priorities: latest 10-K vs prior year 126,4231,284,57010.2× $0.44$4.34 9.1 σ 0.22.3 σ 0.5 99s427s

An agent pays for its context every turn.

The token gap is structural, not incidental. Three mechanisms compound:

i.

One call replaces a scrape loop

A question like "build me an account brief" is one account_dossier call against EDGAR-derived structured data. The web arm reconstructs the same facts from search results — query, read, re-query — dozens of times per task.

ii.

The transcript is re-read every turn

Agent loops resend the whole conversation each turn. Fifty turns of accumulated search-result pages means paying for those pages fifty times. Five turns of compact tool results stay cheap all the way down.

iii.

Structured beats prose for retrieval

MCP responses are dense, structured JSON held to a measured per-tool token budget. Search results are prose web pages wrapped in navigation. The same fact costs a fraction of the tokens to deliver.

MCP arm — every tool call, run 1 of 3 5 calls · 61s · $0.24 · judged 8/10
account_dossiercompare_companiesanalyze_filinganalyze_filingfiling_section
Web-search arm — every tool call, run 1 of 3 53 calls · 304s · $2.61 · judged 4/10
code_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execcode_execweb_searchcode_exec
Identical task — "Prepare a one-page account brief for a sales call with Delta Air Lines." Both sequences are verbatim from the run records: every tool invocation the agent made, in order. The other two runs look the same — the MCP arm used the same 5 calls all three times; the web arm took 52 and 61 invocations.

The expensive arm was also the wrong arm.

The web arm didn't just spend more — it scored 2.3–4.2 out of 10 against hand-verified references. Its dominant failure mode was stale data, with confident delivery. From the judge records:

Scatter plot of all 18 benchmark runs, cost per task against judged correctness: MCP runs cluster at under $0.56 scoring 7–9.3, web-search runs spread from $2.54 to $9.10 scoring 2–7
All 18 runs, one dot each. The arms don't overlap: on every scenario, every MCP run was cheaper and scored higher than every web-search run.
Exhibit A · stale fiscal years

Compared the wrong two 10-Ks

Asked for Delta's strategic priorities in the latest 10-K versus the prior year, the web arm compared the FY2024 vs FY2023 filings — while the FY2025 10-K had been on EDGAR for months. Judged 2/10.

Exhibit B · window padding

Padded the screen with old cases

The material-weakness screen asked for the last 6 months. The web arm padded its list with mid-2025 disclosures from outside the window — names that look right and fail on checking.

Exhibit C · provenance

Cited news about filings, not filings

On the 10-K comparison, two of three web runs produced 0% SEC-source citations (0/8 and 0/29) — sourcing claims to articles about the filings rather than the documents themselves.

Why this happens: web search retrieves what has been written about a company — and the web's coverage lags the primary record. The MCP arm reads the EDGAR record directly, so an entire class of staleness failures is eliminated by construction, not by prompting.

Every run, published.

No medians without the underlying runs. Tokens are input + output across the loop; "SEC citations" is the fraction of source citations in the answer that resolve to sec.gov or accession numbers.

Material-weakness screen — companies disclosing ICFR material weakness or Item 4.02 restatement, last 6 months 20.8× tokens · 21.5× cost
arm · reptokens intokens outtool callslatencycostcorrectnessSEC citations
mcp · 1138,5295,85732118.5s$0.5039/10100% (15/15)
mcp · 2107,3755,00522100.4s$0.3979/10100% (3/3)
mcp · 3112,1674,4122189.3s$0.4039/10100% (3/3)
web · 12,718,21220,396138689.7s$9.1017/1044% (4/9)
web · 22,396,02424,093172865.3s$8.6492.67/10100% (3/3)
web · 31,288,81219,163105549.4s$4.7043/10100% (4/4)
Sales-call brief — one-page account brief for a sales call with Delta Air Lines 10.0× tokens · 10.2× cost
arm · reptokens intokens outtool callslatencycostcorrectnessSEC citations
mcp · 167,7542,607561.0s$0.2428/1056% (40/71)
mcp · 272,3222,676560.1s$0.2578.67/1074% (54/73)
mcp · 372,3222,727561.2s$0.2587/1088% (51/58)
web · 1741,6629,07353303.8s$2.6114/1065% (51/79)
web · 2727,5527,05852391.2s$2.5393/1047% (37/79)
web · 3908,9218,40761312.2s$3.1534/1050% (36/72)
Strategic priorities — Delta's stated priorities, latest 10-K vs prior year 10.2× tokens · 9.9× cost
arm · reptokens intokens outtool callslatencycostcorrectnessSEC citations
mcp · 1121,6774,7467100.8s$0.4369.33/10100% (19/19)
mcp · 2121,6314,705797.3s$0.4359/1027% (18/68)
mcp · 3161,3494,594799.0s$0.5539/1061% (17/28)
web · 11,088,32010,81658354.6s$3.7272/100% (0/8)
web · 21,829,83813,01595794.3s$6.1752/100% (0/29)
web · 31,273,34811,22272426.5s$4.3383/107% (2/27)

Built to be attacked.

A benchmark published by the vendor it favors deserves skepticism. These are the rules it runs under — and the scope it does not claim beyond.

1
Same model, same prompt, same caps
Agent model claude-sonnet-4-6 in both arms. Same question, same neutral system prompt ("research assistant, cite sources"), same 8-turn cap. Prompt caching disabled in both arms.
2
The web arm is a strong baseline
Anthropic hosted web_search with code execution — SEC.gov and EDGAR fully reachable, no edgar.tools-specific hinting. This is the same general-purpose retrieval an agent gets out of the box.
3
Tokens measure the real cost
Total input + output across the agent loop. Each turn re-reads the whole context — that compounding is the cost an agent actually pays, and the line item a copilot buyer actually sees.
4
Correctness gates every number
An independent LLM judge (claude-sonnet-4-6) scores each answer against a hand-verified reference built from the primary filings. A cheap wrong answer is not efficiency.
5
Honesty rules
Every scenario is published, including any web search wins (this run: none). Runs that hit the turn cap are recorded as incomplete, not dropped. Costs use a date-stamped pricing snapshot (2026-06-11) including the $10/1,000 web-search fee. Per-run records are above.
Scope of the claim

This measures SEC research tasks. Three scenarios, one domain, one model — the claim is that for research grounded in a primary-source corpus, domain-structured tools beat general retrieval. It is not a claim that web search is a bad tool.

Hosted web search is the right call for questions with no structured corpus behind them. The two are complementary; the finding is about matching the tool to the corpus.

The next round adds a second agent model and a scenario designed to favor web search — the result will be published either way.

A living benchmark.

Each measurement round is appended here with its run date and pricing snapshot. Previous rounds are never overwritten.

RunScenariosToken ratio (web ÷ MCP)Cost ratioCorrectness MCP / webNotes
2026-06-12 3 × 2 arms × 3 reps 10.0× – 20.8× 9.9× – 21.5× 7.9–9.1 / 2.3–4.2 First public round. Sonnet 4.6 both arms. Web search won 0 of 9 cells.

Next round: a second agent model, plus a breaking-news-shaped scenario built to favor web search — published win or lose.

Asked and anticipated.

Three compounding mechanisms. One structured call replaces a search-and-scrape loop — the account-brief task took the MCP arm 5 tool calls and the web arm 53. The agent re-reads its whole context every turn, so fifty turns of accumulated search-result pages get paid for fifty times. And MCP responses are dense structured JSON held to a measured per-tool token budget, while search results are prose pages. None of this is specific to one vendor's search — it's the shape of retrieval-by-search inside an agent loop.
Which is exactly why the methodology is published in full: same model, same prompt, same turn cap, caching off in both arms, SEC.gov fully reachable by the web arm, incomplete runs recorded rather than dropped, and per-run records on this page. The honesty rules require publishing scenarios web search wins — this run it won zero, and the next round includes a scenario designed for it to win. If you find a flaw, we want to hear it: hello@edgar.tools.
Here it meant better answers — 7.9–9.1/10 versus 2.3–4.2/10 against hand-verified references. The usual cost levers (smaller models, compression, fewer turns) trade quality for spend. Reading the primary record through structured tools cuts spend and removes the staleness failures that sank the web arm. That's the finding worth attention: there is no trade-off triangle here.
In this run, $0.26–$0.44 per task in model tokens at Sonnet 4.6 pricing, versus $2.61–$8.65 for the same task through web search. The MCP server itself has a free tier; Pro is $24.99/month and Analyst — which includes the deep-analysis tools used here — is $79.99/month. At a few research tasks a day, the subscription pays for itself in saved tokens.
The scenario definitions, arm configuration, and judge protocol are described above, and the per-run records carry every number we report. Connect any MCP-capable agent to https://app.edgar.tools/mcp/ — setup takes about two minutes from the MCP server page — and run the same questions against your own web-search baseline. If your numbers disagree with ours, tell us.

Spend tokens on answers,
not on retrieval.

Connect the edgar.tools MCP server to Claude, ChatGPT, Cursor, or your own agent. Free tier, two-minute setup.

Run 2026-06-12 · median of 3 · pricing snapshot 2026-06-11 · next round adds a second model