Three SEC research tasks, each run twice by the same agent — once with the edgar.tools MCP server, once with hosted agentic web search. The MCP arm used 10–21× fewer tokens, cost ~10–21× less, finished 4–7× faster — and scored higher on every answer.
A cheap wrong answer is not efficiency, so an independent LLM judge scored every run against a hand-verified reference answer. Every number on this page is the median of three runs, date-stamped, with the per-run records published below.
Each task ran three times per arm on Claude Sonnet 4.6. Tokens are total input + output across the agent loop — each turn re-reads the whole context, which is the real cost an agent pays. Correctness is an LLM judge scoring against a hand-verified reference.
| Task | Tokens · MCP | Tokens · web | Ratio | Cost · MCP | Cost · web | Correct · MCP | Correct · web | Latency · MCP | Latency · web |
|---|---|---|---|---|---|---|---|---|---|
| Companies disclosing ICFR material weakness or Item 4.02 restatement, last 6 months | 116,579 | 2,420,117 | 20.8× | $0.40 | $8.65 | 9.0 σ 0 | 4.2 σ 2.0 | 100s | 690s |
| One-page account brief for a sales call with Delta Air Lines | 74,998 | 750,735 | 10.0× | $0.26 | $2.61 | 7.9 σ 0.7 | 3.7 σ 0.5 | 61s | 312s |
| Delta's stated strategic priorities: latest 10-K vs prior year | 126,423 | 1,284,570 | 10.2× | $0.44 | $4.34 | 9.1 σ 0.2 | 2.3 σ 0.5 | 99s | 427s |
The token gap is structural, not incidental. Three mechanisms compound:
A question like "build me an account brief" is one account_dossier call against EDGAR-derived structured data. The web arm reconstructs the same facts from search results — query, read, re-query — dozens of times per task.
Agent loops resend the whole conversation each turn. Fifty turns of accumulated search-result pages means paying for those pages fifty times. Five turns of compact tool results stay cheap all the way down.
MCP responses are dense, structured JSON held to a measured per-tool token budget. Search results are prose web pages wrapped in navigation. The same fact costs a fraction of the tokens to deliver.
The web arm didn't just spend more — it scored 2.3–4.2 out of 10 against hand-verified references. Its dominant failure mode was stale data, with confident delivery. From the judge records:
Asked for Delta's strategic priorities in the latest 10-K versus the prior year, the web arm compared the FY2024 vs FY2023 filings — while the FY2025 10-K had been on EDGAR for months. Judged 2/10.
The material-weakness screen asked for the last 6 months. The web arm padded its list with mid-2025 disclosures from outside the window — names that look right and fail on checking.
On the 10-K comparison, two of three web runs produced 0% SEC-source citations (0/8 and 0/29) — sourcing claims to articles about the filings rather than the documents themselves.
Why this happens: web search retrieves what has been written about a company — and the web's coverage lags the primary record. The MCP arm reads the EDGAR record directly, so an entire class of staleness failures is eliminated by construction, not by prompting.
No medians without the underlying runs. Tokens are input + output across the loop; "SEC citations" is the fraction of source citations in the answer that resolve to sec.gov or accession numbers.
| arm · rep | tokens in | tokens out | tool calls | latency | cost | correctness | SEC citations |
|---|---|---|---|---|---|---|---|
| mcp · 1 | 138,529 | 5,857 | 32 | 118.5s | $0.503 | 9/10 | 100% (15/15) |
| mcp · 2 | 107,375 | 5,005 | 22 | 100.4s | $0.397 | 9/10 | 100% (3/3) |
| mcp · 3 | 112,167 | 4,412 | 21 | 89.3s | $0.403 | 9/10 | 100% (3/3) |
| web · 1 | 2,718,212 | 20,396 | 138 | 689.7s | $9.101 | 7/10 | 44% (4/9) |
| web · 2 | 2,396,024 | 24,093 | 172 | 865.3s | $8.649 | 2.67/10 | 100% (3/3) |
| web · 3 | 1,288,812 | 19,163 | 105 | 549.4s | $4.704 | 3/10 | 100% (4/4) |
| arm · rep | tokens in | tokens out | tool calls | latency | cost | correctness | SEC citations |
|---|---|---|---|---|---|---|---|
| mcp · 1 | 67,754 | 2,607 | 5 | 61.0s | $0.242 | 8/10 | 56% (40/71) |
| mcp · 2 | 72,322 | 2,676 | 5 | 60.1s | $0.257 | 8.67/10 | 74% (54/73) |
| mcp · 3 | 72,322 | 2,727 | 5 | 61.2s | $0.258 | 7/10 | 88% (51/58) |
| web · 1 | 741,662 | 9,073 | 53 | 303.8s | $2.611 | 4/10 | 65% (51/79) |
| web · 2 | 727,552 | 7,058 | 52 | 391.2s | $2.539 | 3/10 | 47% (37/79) |
| web · 3 | 908,921 | 8,407 | 61 | 312.2s | $3.153 | 4/10 | 50% (36/72) |
| arm · rep | tokens in | tokens out | tool calls | latency | cost | correctness | SEC citations |
|---|---|---|---|---|---|---|---|
| mcp · 1 | 121,677 | 4,746 | 7 | 100.8s | $0.436 | 9.33/10 | 100% (19/19) |
| mcp · 2 | 121,631 | 4,705 | 7 | 97.3s | $0.435 | 9/10 | 27% (18/68) |
| mcp · 3 | 161,349 | 4,594 | 7 | 99.0s | $0.553 | 9/10 | 61% (17/28) |
| web · 1 | 1,088,320 | 10,816 | 58 | 354.6s | $3.727 | 2/10 | 0% (0/8) |
| web · 2 | 1,829,838 | 13,015 | 95 | 794.3s | $6.175 | 2/10 | 0% (0/29) |
| web · 3 | 1,273,348 | 11,222 | 72 | 426.5s | $4.338 | 3/10 | 7% (2/27) |
A benchmark published by the vendor it favors deserves skepticism. These are the rules it runs under — and the scope it does not claim beyond.
claude-sonnet-4-6 in both arms. Same question, same neutral system prompt ("research assistant, cite sources"), same 8-turn cap. Prompt caching disabled in both arms.web_search with code execution — SEC.gov and EDGAR fully reachable, no edgar.tools-specific hinting. This is the same general-purpose retrieval an agent gets out of the box.claude-sonnet-4-6) scores each answer against a hand-verified reference built from the primary filings. A cheap wrong answer is not efficiency.This measures SEC research tasks. Three scenarios, one domain, one model — the claim is that for research grounded in a primary-source corpus, domain-structured tools beat general retrieval. It is not a claim that web search is a bad tool.
Hosted web search is the right call for questions with no structured corpus behind them. The two are complementary; the finding is about matching the tool to the corpus.
The next round adds a second agent model and a scenario designed to favor web search — the result will be published either way.
Each measurement round is appended here with its run date and pricing snapshot. Previous rounds are never overwritten.
| Run | Scenarios | Token ratio (web ÷ MCP) | Cost ratio | Correctness MCP / web | Notes |
|---|---|---|---|---|---|
| 2026-06-12 | 3 × 2 arms × 3 reps | 10.0× – 20.8× | 9.9× – 21.5× | 7.9–9.1 / 2.3–4.2 | First public round. Sonnet 4.6 both arms. Web search won 0 of 9 cells. |
Next round: a second agent model, plus a breaking-news-shaped scenario built to favor web search — published win or lose.
hello@edgar.tools.https://app.edgar.tools/mcp/ — setup takes about two minutes from the MCP server page — and run the same questions against your own web-search baseline. If your numbers disagree with ours, tell us.Connect the edgar.tools MCP server to Claude, ChatGPT, Cursor, or your own agent. Free tier, two-minute setup.