[Draft] Agents Can Reason. They Still Can't Really Search.
Agents have a search problem across the whole stack โ web, knowledge, tools, skills, and context. The recurring bottleneck is never reasoning. It is always retrieval.
Modern agents can write code, call APIs, draft a memo, and pass a benchmark. That part is real. Put one in front of a clean, well-scoped task and it can look genuinely magical.
Then you ask it to do something normal.
Find the pricing page for a competitor that just relaunched their site. Pull a clause from a regulatory filing hidden inside a government portal. Answer a question that requires connecting facts spread across three internal docs written by people who already left the company. Deploy to an infrastructure setup with custom flags, a weird CI config, and a workaround for a flaky pre-push hook that somebody documented once in a Notion page nobody can find. Or just pick the right tool from a catalog of sixty.
This is where things start falling apart.
Not because the model suddenly forgot how to reason. Not because the prompt is missing some sacred incantation. The failure is more basic than that, and once you see it, you start seeing the same bug everywhere.
The recurring bottleneck is search
Search here means one simple thing: before an agent can reason well, it has to find the right thing.
That "thing" might be:
- a source on the public web
- a useful chunk from your private docs
- the right tool or MCP action
- the right skill or procedure
- the relevant part of a long context window
Agents fail on real-world tasks because they keep running into this problem in different places. If any one of those breaks, the whole task usually breaks with it.

You can see the same pattern in a few different places:
- web and external search
- knowledge retrieval over private documents
- tool and MCP discovery
- skill and procedure loading
- navigation inside long context itself
That last category matters more than it seems. A context window is only useful if the model can find the right thing inside it at the right time. Bigger context windows do not remove search. They just move search inside the model.
The rest of this post walks through each one.
Problem 1: The web was not built for agents
Let's start with the obvious version of the problem: web search.
Agents need web search for very ordinary reasons:
- a personalized daily digest has to know what happened today, not at pretraining cutoff
- a market-monitoring agent has to track competitor pricing, product launches, and changelogs
- a research agent has to verify claims against primary sources
- a shopping or travel agent has to compare pages that change constantly
- a coding agent has to read the latest docs, issues, and release notes
In other words, the minute the task depends on freshness, verification, or public evidence, the agent needs the web.
Most teams assume this part is already solved. Add a built-in web search tool, get citations back, move on.
But web search for an agent is not a simple lookup. It is a pipeline:
- come up with the right query
- pick the right source
- actually load the page
- render it if JavaScript is involved
- extract the useful part from noisy HTML
- decide whether the evidence is enough
- refine the query and try again if needed
Any one of those steps can fail.
Consider a founder building a competitive intelligence agent. The agent finds the right company page. The page is JavaScript-rendered. Cloudflare is blocking the headless browser. The content that matters is behind a soft login wall. The web search tool returned the URL. Getting what is actually on the page is a different product entirely, which is why Browserbase sells stealth mode, CAPTCHA solving, proxies, and even highlights its Cloudflare signed agents work. That product exists because the failure mode is real and systematic.
Agents do not browse the web the way humans do. They negotiate with it.
The managed web search tools from frontier labs such as OpenAI, Anthropic, and Google are useful. They return citations, handle some of the pipeline, and are now billed as explicit line items separate from model tokens. OpenAI and Anthropic both price web search at $10 per 1,000 searches. That pricing signal matters. The industry has already admitted that retrieval is not some free background utility. It is its own product surface with its own cost structure.
But even with those tools, the hard part is not fully solved. Provider-native search is great when you want "an answer with citations." It is much weaker when you need repeated monitoring, raw page access, extraction from messy sites, deeper iteration, or a reliable fetch primitive inside your own agent stack. A competitor-tracking agent, for example, does not just need a summarized answer. It needs the actual pricing page, the changed sections, maybe the FAQ, maybe the release notes, and often the raw content for comparison over time.
That gap is exactly why Firecrawl, Exa, Tavily, and Parallel exist. Firecrawl's own search API exposes scrapeOptions because "find the page" and "get the useful content" are different operations. Parallel makes the same point from another angle: its Search API is pitched as collapsing the traditional search -> scrape -> extract pipeline into one API, and its Search MCP exposes web_search and web_fetch as the basic primitives for agents. Their product language is useful because it indirectly admits the same thing: agent search is not just ranking links. It is discovery plus access plus extraction plus compression for the next reasoning step.
Problem 2: RAG solved the easy slice
Now let's move one layer inward.
The first generation of retrieval-augmented generation made the problem look tractable. Embed your documents, store vectors, retrieve the top-k most similar chunks, append them to the prompt. For narrow, well-scoped, single-hop questions over a clean corpus, this works.
It breaks on anything harder.
Suppose you build a technical QA system over internal docs. Single-hop questions work well. Then someone asks a question that requires connecting a constraint described in one document with a definition from another and a caveat buried in a third. Cosine similarity returns three chunks that look individually relevant, but they do not compose into an answer. The model finds each piece, but the retrieval step never actually bridges the gap between them.
This failure is not accidental. It is structural. Similarity is not the same as usefulness. A chunk can be semantically close to a query and still be useless for the final answer. Another chunk can look semantically distant and still be essential for a reasoning step three hops later. This is exactly why IRCoT (interleaving retrieval with chain-of-thought, ACL 2023) and Self-RAG exist as research directions. One-shot retrieve-then-read hit a real ceiling, so the field moved toward iterative and adaptive retrieval.
So the evolution is straightforward:
- simple RAG: retrieve once, read once
- better RAG: retrieve, reflect, and try again
- agentic RAG: break the problem apart, search in parallel, merge evidence, decide whether more search is needed
This is why "agentic RAG" is now becoming a product surface, not just a paper idea. Azure AI Search now has agentic retrieval, where an LLM breaks a complex query into smaller subqueries, runs them in parallel, and merges the result. Their own example is basically a multi-hop retrieval problem in plain English: "find me a hotel near the beach, with airport transportation, and that's within walking distance of vegetarian restaurants." That kind of query is awkward for classic one-shot retrieval, but much better suited to query decomposition plus parallel search.
So yes, agentic RAG is solving a real problem. It is helping with multi-hop questions, multi-ask queries, and situations where the original user query is too broad or under-specified for one retrieval pass.
But it is still far from fully solved.
Even after you decompose a question well, a bunch of hard problems remain:
- the needed source might not be indexed at all
- the relevant page might be stale, contradictory, or poorly chunked
- the evidence might live across text, tables, and UI state instead of neat paragraphs
- one subquery can retrieve locally relevant passages that are still useless for the final answer
- the system still has to decide when it has enough evidence and when to keep searching
- each extra retrieval step adds latency and cost
Microsoft's own agentic retrieval docs say the LLM-based query planning adds latency, even if parallel execution helps compensate. That tradeoff is important. Agentic RAG is not a free accuracy upgrade. It is a better search policy with more moving parts.
A very normal real-world example is enterprise support. A user asks: "Does our enterprise plan support SSO for contractors, what changed in the last release, and are there regional limits for EU tenants?" The answer might live across pricing docs, old help-center pages, release notes, and an internal policy page. Agentic RAG is clearly better than one-shot top-k retrieval here because it can break the question apart. But it can still fail if one of those sources is stale, if the important caveat is hidden in a table, or if the retrieval system stops after finding something merely plausible.
And this gets worse as the organization gets bigger.
At small scale, RAG usually fails in understandable ways: bad chunking, weak embeddings, poor prompts. At big-company scale, it starts failing for more boring reasons:
- the same fact exists in five places, but only one copy is current
- permissions mean the best document exists, but the system cannot show it to this user
- different teams store knowledge in different tools with different metadata quality
- highly selective filters improve security but can hurt recall or latency
- constant document churn means the index is always racing reality
- vector storage and query cost stop being abstract and start becoming infrastructure constraints
This is why enterprise search products like Glean keep emphasizing 100+ connectors and real-time permissions-aware retrieval. They are not doing that for marketing decoration. They are reacting to the actual shape of the problem inside big companies: knowledge is fragmented across Slack, Confluence, Jira, Google Drive, Notion, wikis, tickets, PDFs, and internal apps, and the permission model is part of retrieval, not an afterthought.
Even the lower-level search infrastructure shows the same pain. Azure AI Search's vector filter documentation explicitly calls out a tradeoff between filtering, recall, and latency, and notes that some filter modes can produce false negatives for selective filters or small k. That matters a lot in enterprises because security and access control are often implemented as filters. So the retrieval system is not just trying to find the most relevant passage. It is trying to find the most relevant passage among the subset this user is allowed to see, while still being fast enough to feel interactive.
There is also a scale tax on the index itself. Azure documents vector index size limits and storage tradeoffs because large corpora consume memory and can require multiple stored copies depending on the workload. So even before the model starts reasoning, the retrieval layer is already trading off freshness, cost, recall, latency, and access control. A very normal enterprise question like "What is the current travel reimbursement policy for contractors in Germany?" can span an HR PDF, a newer policy page, a regional addendum, a legal exception in shared drive, and a stale Slack workaround. The hard part is not generating the answer. The hard part is finding the newest authoritative source and ignoring the plausible but outdated ones.
RAG treated retrieval like a database lookup. Agentic systems reveal that retrieval is closer to exploration.
Problem 3: MCP and tools moved the problem up the stack
The Model Context Protocol gave agents a standard way to connect to tools. This is genuinely useful. It also made something more obvious: tools themselves are now a search problem.
Once an agent has access to fifty or more tools, it runs into a familiar problem in a new form. Which tool is relevant? Which action name is correct? Is authentication already set up? Which capabilities should even be visible right now?
Anthropic's own advanced tool use documentation puts a number on this: large tool catalogs can push tool definitions past 50,000 tokens before the model has even read the user's request. Their recommended fix is to use a smaller retrieval model to return only the relevant tools based on user intent, and even use semantic search over tool descriptions. That recommendation is RAG. For actions.
And the ecosystem is already moving in that direction. Anthropic's own advanced tool use engineering post says agents should "discover and load tools on-demand" instead of stuffing every definition into context upfront. LangGraph added dynamic tool calling so the available tools can change at different points in a run. Their examples are telling: require an auth tool before exposing sensitive actions, start with a small toolset, then expand as the task evolves. Salesforce's DX MCP blog makes the same move with toolsets, noting that hosts can dynamically load only the tools they need to minimize memory use and improve performance.
That is the deeper point. The problem is not just "which tool should the model call?" The problem is also "which tools should even be attached right now?" Static attachment made sense when agents had a handful of tools. It breaks down when the catalog is large, sensitive, or step-dependent. So now we are seeing dynamic tool attachment, scoped tool exposure, and tool retrieval as separate design patterns.
We solved document overload by inventing retrieval. Now we are rebuilding the same fix for tools. Composio's Tool Router, which explicitly searches, plans, and authenticates across tool ecosystems, is basically a retrieval layer for actions. Even outside product docs, the ecosystem keeps describing the same pain: Apify recently summarized the MCP moment as context overload, auth pain, and failed tool calls everywhere. Once you have enough MCP servers, you need search to find your search tools.
Problem 4: Skills are workflow search
At this point, there is one more kind of thing the agent needs to find: workflow.
Agents do not just lack facts and tools. They also lack reusable, environment-specific know-how.
Consider a coding agent that needs to deploy to an internal infrastructure with custom build flags, a non-standard CI configuration, and a known workaround for a flaky pre-push hook. None of this is in pretraining. Without a skill, the agent has to rediscover the workaround by trial and error every time. It burns tokens, fails steps, and eventually needs help. With a skill, it loads the procedure on demand, executes it, and moves on.
Skills are what happens when you stop making the agent rediscover the same workflow every turn.
This is also where the ecosystem is starting to converge on a few file-level conventions.
At the project layer, we now have dedicated memory files such as AGENTS.md and CLAUDE.md. They look similar, but they are solving a slightly different problem than skills.
AGENTS.mdis emerging as a simple open format for repo-level instructions for coding agents- OpenAI explicitly recommends
AGENTS.mdfor Codex so the agent can learn repo conventions, testing commands, and project-specific gotchas - Anthropic uses
CLAUDE.mdas Claude Code's project memory, with a hierarchy that can include enterprise, project, and user-level memory files
These files are useful, but they are not the whole answer. They are mostly always-on project memory. Skills are more selective. They are a way to package a reusable capability so the agent can discover it and load it only when needed.
The core issue is simple: you cannot stuff every workflow into the prompt. OpenAI's own Codex engineering write-up says the "one big AGENTS.md" approach failed and that AGENTS.md works better as a map than as an encyclopedia. That is the same pattern we keep seeing everywhere else. Once the context gets large enough, the problem becomes navigation again.
So the stack is starting to separate into two layers:
AGENTS.md/CLAUDE.mdfor always-on project memorySKILL.md/skill.mdfor workflows that should be loaded on demand
That second layer is getting standardized too. OpenClaw treats skills as Agent Skills-compatible folders, the Agent Skills specification defines SKILL.md with progressive disclosure, Vercel's skills ecosystem is pushing the same format across agents, and Mintlify now auto-generates skill.md for docs. The reason this works is straightforward: Hermes uses progressive disclosure for skills because not every workflow should live in prompt context all the time. Some workflows need their own retrieval layer.
Documents answer what is true. Tools answer what can I do. Skills answer how should I do it here.
Problem 5: A bigger context window is a larger search space
Now for the part I think people still under-appreciate: context itself.
The common response to long-context failures is simple. If the model cannot find the relevant information, give it a bigger context window. This framing is almost exactly backwards.
A larger context window does not automatically improve the model's ability to locate what matters inside it. It increases the size of the space the model has to navigate. The bottleneck is not room. It is navigation.
Consider a research agent processing a 200-page technical report. The binding constraint appears on page four. The answer that depends on it is on page 180. The model can individually look at both sections and still fail to connect them. This is basically the "lost in the middle" problem: relevant information buried inside a long input is used less reliably than information near the edges.
This is why Recursive Language Models introduce explicit navigation operators such as peek, partition, grep, and zoom instead of just extending sequence length forever. Those are search operations over context. Related work like LCM makes the same point from another angle: long context and local search need to work together. Once you look at it this way, recursive context methods start looking less like magic context scaling and more like retrieval policies over an internal search space.
Context engineering is just search engineering with better marketing.
What serious agent stacks will look like
The market has already reached a verdict on this, even if the marketing language is all over the place. Frontier labs bill for search. Browser infrastructure companies sell anti-bot handling as a first-class feature. Tool vendors are building routers. Skill systems use progressive loading. Recursive context methods add navigation operators. Different layer, same problem.
I think every serious agent stack will converge on at least five retrieval layers running in parallel: web and external search, private knowledge retrieval, tool discovery, workflow skill loading, and context navigation. These are not optional extras you bolt on later. They are the conditions under which real tasks become solvable.
The winning agent products will look less like single models and more like retrieval orchestration systems โ systems that know what to search, where to search it, when the evidence is sufficient, and when to load a procedure rather than reason from scratch. The model sits in the middle. The retrieval infrastructure is the product.
Search did not go away when agents arrived. It expanded.
References:
- Apify X post on MCP pain: x.com/apify/status/2011556498477105383
- Agent Skills: Specification
- AGENTS.md: Open format
- Anthropic advanced tool use guide: Tool use implementation
- Anthropic engineering: Advanced tool use
- Anthropic Claude Code memory: CLAUDE.md memory
- Azure AI Search: Agentic retrieval
- Azure AI Search: Vector filters
- Azure AI Search: Vector index size
- Azure AI Search: Vector storage options
- Browserbase Cloudflare post: Browserbase + Cloudflare
- Composio Tool Router: Tool Router API
- Firecrawl search docs: Search API
- Glean: Enterprise search engine
- Hermes skills: Progressive disclosure
- IRCoT, Trivedi et al. (ACL 2023): Interleaving Retrieval with Chain-of-Thought Reasoning
- LCM: Long Context Models and local search
- LangGraph: Dynamic tool calling
- Mintlify: skill.md
- OpenAI: How OpenAI uses Codex to build Codex
- OpenAI: Harness engineering for an agent-centric world
- Self-RAG, Asai et al. (NeurIPS 2023): Self-RAG: Learning to Retrieve, Generate, and Critique
- Salesforce DX MCP: Dynamic toolsets
- Lost in the Middle, Liu et al. (TACL 2024): How Language Models Use Long Contexts
- OpenClaw skills: Skills docs
- Parallel Search API: Search quickstart
- Parallel Search MCP: Search MCP
- Parallel Search product: Parallel Search
- Recursive Language Models: Paper ยท Repo
- Vercel: Introducing skills
- Browserbase stealth mode: docs.browserbase.com/features/stealth-mode