Why does agentic web search use 10–21× more tokens than an MCP server for SEC research?

Three compounding reasons. One structured MCP call replaces a search-and-scrape loop: a task the MCP arm finished in 5 tool calls took the web arm 53 tool invocations. An agent re-reads its whole context every turn, so the web arm pays its growing transcript of search results dozens of times over. And MCP responses are dense structured JSON kept on a measured token budget, while search results are prose web pages. In the 2026-06-12 run the web arm used 10.0–20.8× more tokens per task.

Is the benchmark fair to web search?

Both arms ran the same agent model (Claude Sonnet 4.6), the same question, the same neutral system prompt, and the same 8-turn cap, with prompt caching disabled in both. The web arm used Anthropic hosted web search with SEC.gov and EDGAR fully reachable and no edgar.tools-specific hinting. Runs that hit the turn cap are recorded as incomplete, not dropped. The honesty rules require publishing every scenario including any that web search wins — in this run it won zero — and the per-run records are published on the page.

Does cheaper mean lower quality?

No — in this benchmark the cheaper arm was also the more correct one. An independent LLM judge scored each answer against a hand-verified reference: the MCP arm scored 7.9–9.1 out of 10 per task while the web arm scored 2.3–4.2. The web arm's dominant failure mode was stale data — comparing fiscal-year 2024 vs 2023 10-Ks when the FY2025 10-K was already on file, and citing news articles about filings rather than the filings themselves.

What does an SEC research task cost through the edgar.tools MCP server?

In the 2026-06-12 run the MCP arm cost $0.26–$0.44 in model tokens per research task at Claude Sonnet 4.6 pricing, versus $2.61–$8.65 for the same task through agentic web search (model tokens plus web-search fees). The edgar.tools MCP server itself has a free tier; paid tiers are $24.99/month (Pro) and $79.99/month (Analyst).

How often is the benchmark re-run?

Each measurement round is appended to the round table on the page with its run date and pricing snapshot — previous rounds are never overwritten. The next round adds a second agent model and a scenario designed to favor web search, and the result will be published either way.

Benchmark · Run 2026-06-12 · Living document

Same model. Same question.
One-tenth the tokens.

Three SEC research tasks, each run twice by the same agent — once with the edgar.tools MCP server, once with hosted agentic web search. The MCP arm used 10–21× fewer tokens, cost ~10–21× less, finished 4–7× faster — and scored higher on every answer.

A cheap wrong answer is not efficiency, so an independent LLM judge scored every run against a hand-verified reference answer. Every number on this page is the median of three runs, date-stamped, with the per-run records published below.

tokens 10.0–20.8×· cost 9.9–21.5×· latency 4.3–6.9×· cells won by web search 0 of 9

Connect the MCP server Read the methodology ↓

01Results

Three tasks, two arms, nine cells — web search won none.

Each task ran three times per arm on Claude Sonnet 4.6. Tokens are total input + output across the agent loop — each turn re-reads the whole context, which is the real cost an agent pays. Correctness is an LLM judge scoring against a hand-verified reference.

Material-weakness screen · last 6 months

20.8×

tokens, web ÷ MCP

MCP9.0 / 10

web4.2 / 10

Sales-call account brief · Delta Air Lines

10.0×

tokens, web ÷ MCP

MCP7.9 / 10

web3.7 / 10

Strategic priorities · latest 10-K vs prior year

10.2×

tokens, web ÷ MCP

MCP9.1 / 10

web2.3 / 10

Median of 3 runs per cell. Agent model claude-sonnet-4-6 both arms, prompt caching disabled. Cost = model tokens at the 2026-06-11 pricing snapshot, plus $10 per 1,000 web searches in the web arm.
Task	Tokens · MCP	Tokens · web	Ratio	Cost · MCP	Cost · web	Correct · MCP	Correct · web	Latency · MCP	Latency · web
Companies disclosing ICFR material weakness or Item 4.02 restatement, last 6 months	116,579	2,420,117	20.8×	$0.40	$8.65	9.0 σ 0	4.2 σ 2.0	100s	690s
One-page account brief for a sales call with Delta Air Lines	74,998	750,735	10.0×	$0.26	$2.61	7.9 σ 0.7	3.7 σ 0.5	61s	312s
Delta's stated strategic priorities: latest 10-K vs prior year	126,423	1,284,570	10.2×	$0.44	$4.34	9.1 σ 0.2	2.3 σ 0.5	99s	427s

02Why the gap

An agent pays for its context every turn.

The token gap is structural, not incidental. Three mechanisms compound:

One call replaces a scrape loop

A question like "build me an account brief" is one account_dossier call against EDGAR-derived structured data. The web arm reconstructs the same facts from search results — query, read, re-query — dozens of times per task.

ii.

The transcript is re-read every turn

Agent loops resend the whole conversation each turn. Fifty turns of accumulated search-result pages means paying for those pages fifty times. Five turns of compact tool results stay cheap all the way down.

iii.

Structured beats prose for retrieval

MCP responses are dense, structured JSON held to a measured per-tool token budget. Search results are prose web pages wrapped in navigation. The same fact costs a fraction of the tokens to deliver.

MCP arm — every tool call, run 1 of 3 5 calls · 61s · $0.24 · judged 8/10

account_dossiercompare_companiesanalyze_filinganalyze_filingfiling_section

Web-search arm — every tool call, run 1 of 3 53 calls · 304s · $2.61 · judged 4/10

code_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execweb_searchcode_execcode_execweb_searchcode_exec

Identical task — "Prepare a one-page account brief for a sales call with Delta Air Lines." Both sequences are verbatim from the run records: every tool invocation the agent made, in order. The other two runs look the same — the MCP arm used the same 5 calls all three times; the web arm took 52 and 61 invocations.

03Quality

The expensive arm was also the wrong arm.

The web arm didn't just spend more — it scored 2.3–4.2 out of 10 against hand-verified references. Its dominant failure mode was stale data, with confident delivery. From the judge records:

Scatter plot of all 18 benchmark runs, cost per task against judged correctness: MCP runs cluster at under $0.56 scoring 7–9.3, web-search runs spread from $2.54 to $9.10 scoring 2–7 — All 18 runs, one dot each. The arms don't overlap: on every scenario, every MCP run was cheaper *and* scored higher than every web-search run.

Exhibit A · stale fiscal years

Compared the wrong two 10-Ks

Asked for Delta's strategic priorities in the latest 10-K versus the prior year, the web arm compared the FY2024 vs FY2023 filings — while the FY2025 10-K had been on EDGAR for months. Judged 2/10.

Exhibit B · window padding

Padded the screen with old cases

The material-weakness screen asked for the last 6 months. The web arm padded its list with mid-2025 disclosures from outside the window — names that look right and fail on checking.

Exhibit C · provenance

Cited news about filings, not filings

On the 10-K comparison, two of three web runs produced 0% SEC-source citations (0/8 and 0/29) — sourcing claims to articles about the filings rather than the documents themselves.

Why this happens: web search retrieves what has been written about a company — and the web's coverage lags the primary record. The MCP arm reads the EDGAR record directly, so an entire class of staleness failures is eliminated by construction, not by prompting.

04Per-run records

Every run, published.

No medians without the underlying runs. Tokens are input + output across the loop; "SEC citations" is the fraction of source citations in the answer that resolve to sec.gov or accession numbers.

Material-weakness screen — companies disclosing ICFR material weakness or Item 4.02 restatement, last 6 months 20.8× tokens · 21.5× cost

arm · rep	tokens in	tokens out	tool calls	latency	cost	correctness	SEC citations
mcp · 1	138,529	5,857	32	118.5s	$0.503	9/10	100% (15/15)
mcp · 2	107,375	5,005	22	100.4s	$0.397	9/10	100% (3/3)
mcp · 3	112,167	4,412	21	89.3s	$0.403	9/10	100% (3/3)
web · 1	2,718,212	20,396	138	689.7s	$9.101	7/10	44% (4/9)
web · 2	2,396,024	24,093	172	865.3s	$8.649	2.67/10	100% (3/3)
web · 3	1,288,812	19,163	105	549.4s	$4.704	3/10	100% (4/4)

Sales-call brief — one-page account brief for a sales call with Delta Air Lines 10.0× tokens · 10.2× cost

arm · rep	tokens in	tokens out	tool calls	latency	cost	correctness	SEC citations
mcp · 1	67,754	2,607	5	61.0s	$0.242	8/10	56% (40/71)
mcp · 2	72,322	2,676	5	60.1s	$0.257	8.67/10	74% (54/73)
mcp · 3	72,322	2,727	5	61.2s	$0.258	7/10	88% (51/58)
web · 1	741,662	9,073	53	303.8s	$2.611	4/10	65% (51/79)
web · 2	727,552	7,058	52	391.2s	$2.539	3/10	47% (37/79)
web · 3	908,921	8,407	61	312.2s	$3.153	4/10	50% (36/72)

Strategic priorities — Delta's stated priorities, latest 10-K vs prior year 10.2× tokens · 9.9× cost

arm · rep	tokens in	tokens out	tool calls	latency	cost	correctness	SEC citations
mcp · 1	121,677	4,746	7	100.8s	$0.436	9.33/10	100% (19/19)
mcp · 2	121,631	4,705	7	97.3s	$0.435	9/10	27% (18/68)
mcp · 3	161,349	4,594	7	99.0s	$0.553	9/10	61% (17/28)
web · 1	1,088,320	10,816	58	354.6s	$3.727	2/10	0% (0/8)
web · 2	1,829,838	13,015	95	794.3s	$6.175	2/10	0% (0/29)
web · 3	1,273,348	11,222	72	426.5s	$4.338	3/10	7% (2/27)

05Methodology

Built to be attacked.

A benchmark published by the vendor it favors deserves skepticism. These are the rules it runs under — and the scope it does not claim beyond.

Same model, same prompt, same caps

Agent model claude-sonnet-4-6 in both arms. Same question, same neutral system prompt ("research assistant, cite sources"), same 8-turn cap. Prompt caching disabled in both arms.

The web arm is a strong baseline

Anthropic hosted web_search with code execution — SEC.gov and EDGAR fully reachable, no edgar.tools-specific hinting. This is the same general-purpose retrieval an agent gets out of the box.

Tokens measure the real cost

Total input + output across the agent loop. Each turn re-reads the whole context — that compounding is the cost an agent actually pays, and the line item a copilot buyer actually sees.

Correctness gates every number

An independent LLM judge (claude-sonnet-4-6) scores each answer against a hand-verified reference built from the primary filings. A cheap wrong answer is not efficiency.

Honesty rules

Every scenario is published, including any web search wins (this run: none). Runs that hit the turn cap are recorded as incomplete, not dropped. Costs use a date-stamped pricing snapshot (2026-06-11) including the $10/1,000 web-search fee. Per-run records are above.

Scope of the claim

This measures SEC research tasks. Three scenarios, one domain, one model — the claim is that for research grounded in a primary-source corpus, domain-structured tools beat general retrieval. It is not a claim that web search is a bad tool.

Hosted web search is the right call for questions with no structured corpus behind them. The two are complementary; the finding is about matching the tool to the corpus.

The next round adds a second agent model and a scenario designed to favor web search — the result will be published either way.

06Rounds

A living benchmark.

Each measurement round is appended here with its run date and pricing snapshot. Previous rounds are never overwritten.

Run	Scenarios	Token ratio (web ÷ MCP)	Cost ratio	Correctness MCP / web	Notes
2026-06-12	3 × 2 arms × 3 reps	10.0× – 20.8×	9.9× – 21.5×	7.9–9.1 / 2.3–4.2	First public round. Sonnet 4.6 both arms. Web search won 0 of 9 cells.

Next round: a second agent model, plus a breaking-news-shaped scenario built to favor web search — published win or lose.

07Questions

Asked and anticipated.

Why does web search use 10–21× more tokens for the same task?

Three compounding mechanisms. One structured call replaces a search-and-scrape loop — the account-brief task took the MCP arm 5 tool calls and the web arm 53. The agent re-reads its whole context every turn, so fifty turns of accumulated search-result pages get paid for fifty times. And MCP responses are dense structured JSON held to a measured per-tool token budget, while search results are prose pages. None of this is specific to one vendor's search — it's the shape of retrieval-by-search inside an agent loop.

Is this fair to web search? You sell the other arm.

Which is exactly why the methodology is published in full: same model, same prompt, same turn cap, caching off in both arms, SEC.gov fully reachable by the web arm, incomplete runs recorded rather than dropped, and per-run records on this page. The honesty rules require publishing scenarios web search wins — this run it won zero, and the next round includes a scenario designed for it to win. If you find a flaw, we want to hear it: hello@edgar.tools.

Does cheaper mean worse answers?

Here it meant better answers — 7.9–9.1/10 versus 2.3–4.2/10 against hand-verified references. The usual cost levers (smaller models, compression, fewer turns) trade quality for spend. Reading the primary record through structured tools cuts spend and removes the staleness failures that sank the web arm. That's the finding worth attention: there is no trade-off triangle here.

What does a research task cost through the MCP server?

In this run, $0.26–$0.44 per task in model tokens at Sonnet 4.6 pricing, versus $2.61–$8.65 for the same task through web search. The MCP server itself has a free tier; Pro is $24.99/month and Analyst — which includes the deep-analysis tools used here — is $79.99/month. At a few research tasks a day, the subscription pays for itself in saved tokens.

Can I reproduce this?

The scenario definitions, arm configuration, and judge protocol are described above, and the per-run records carry every number we report. Connect any MCP-capable agent to https://app.edgar.tools/mcp/ — setup takes about two minutes from the MCP server page — and run the same questions against your own web-search baseline. If your numbers disagree with ours, tell us.

Spend tokens on answers,
not on retrieval.

Connect the edgar.tools MCP server to Claude, ChatGPT, Cursor, or your own agent. Free tier, two-minute setup.

Connect the MCP server See pricing →

Run 2026-06-12 · median of 3 · pricing snapshot 2026-06-11 · next round adds a second model

Three tasks, two arms, nine cells — web search won none.

An agent pays for its context every turn.

One call replaces a scrape loop

The transcript is re-read every turn

Structured beats prose for retrieval

The expensive arm was also the wrong arm.

Compared the wrong two 10-Ks

Padded the screen with old cases

Cited news about filings, not filings

Every run, published.

Built to be attacked.

A living benchmark.

Asked and anticipated.

Spend tokens on answers,not on retrieval.

Spend tokens on answers,
not on retrieval.