Key takeaways
- Current MCP clients inject all tool definitions into every prompt, wasting tokens and context space on hundreds of unused tools that can cost $0.05-0.10+ per turn.
- Advertise-and-activate sends only tool summaries initially, loading full definitions only when requested — reducing token usage by up to 94% in common scenarios.
- This pattern requires no MCP protocol changes and can be implemented today by having clients manage tool injection intelligently with an activation function.
The Model Context Protocol (MCP) is an exciting development in how LLMs connect to external tools. MCP lets LLM-based assistants dynamically call into servers that expose APIs for everything from code generation to cloud management. It's a simple idea with major implications: instead of hard‑coding integrations, we can give LLMs a structured "tool belt."
But there's one thing MCP clients don't do particularly well — yet.
Mainstream clients (like Claude Code or Cursor) fetch all available tools from connected servers and inject all tool definitions into the LLM prompt on every single turn. That's fine for a small server with three or four functions. But what happens when a server exposes hundreds of tools? Or thousands?
The problem: bloat, tokens, and latency
Right now, every MCP-enabled LLM session pays a cost for how clients handle tool injection:
- Token cost for users: Every tool definition gets stuffed into the model's context. More text = more tokens. More tokens = more cost. Even if the LLM never calls those tools, the user is paying for the privilege of having them loaded. A single comprehensive MCP server could easily consume 10,000+ tokens just for tool definitions — that's $0.05-0.10 per prompt in unnecessary overhead on many models.
- The description dilemma: Tool use is far more effective when tools include comprehensive, even verbose descriptions explaining exactly what the tool does and how it should be used. Good tool authors write detailed descriptions with examples, parameter explanations, and usage guidelines — but this best practice directly conflicts with token efficiency. A well-documented tool might use 200-500 tokens just for its description, making the problem exponentially worse as servers scale up.
- Latency hit: The client must format and inject all those schemas into every prompt. Even with local caching of definitions, the client still has to stuff that extensive list into the LLM prompt. This introduces significant latency that adds up, especially with large tool sets.
- Scaling pain for LLM providers: Every prompt is bigger. Bigger prompts mean more compute, higher memory pressure, and slower throughput at scale. For providers serving millions of requests, even a 5% reduction in prompt size can translate to significant infrastructure savings.
- Context window exhaustion: Modern models may have 100k+ token windows, but that space is precious. Tool definitions that consume 10-20% of the context window leave less room for actual conversation history and documents.
- Cognitive overload: Presenting hundreds of unrelated tools can impair model performance. When faced with an overwhelming array of options, even sophisticated models may struggle to identify the most relevant tools for the task at hand, leading to suboptimal tool selection or unnecessary confusion in their reasoning process.
This client behavior effectively assumes that every conversation might use every tool — which is rarely true. In practice, most conversations use only two or three tools out of potentially hundreds available.
The proposal: advertise, then activate
Here's the idea: let clients intelligently manage which tool definitions the LLM sees.
The best part? This can be implemented today without waiting for MCP protocol changes. Here's how:
{ "tools": [ /* existing full tool definitions */ ], "toolSummaries": { "aws": { "summary": "Manage AWS EC2, S3, IAM", "keywords": ["cloud", "infrastructure", "compute", "storage"], "toolCount": 147 }, "github": { "summary": "Manage repos, issues, PRs", "keywords": ["git", "version control", "code review"], "toolCount": 23 }, "slack": { "summary": "Send messages, manage channels", "keywords": ["messaging", "communication", "notifications"], "toolCount": 12 } } }
1. Server provides tool summaries: MCP servers would provide a new optional field containing tool summaries alongside the full tool definitions:
2. Client manages what the LLM sees: Smart MCP clients would:
- Initially send only the tool summaries to the LLM, not the full definitions
- Add their own built-in
activate_mcp_tools
function that the LLM can call - When the LLM calls this activation function, start including those specific tool definitions
The LLM would see something like:
{ "available_tool_groups": { "aws": "Manage AWS EC2, S3, IAM (147 tools)", "github": "Manage repos, issues, PRs (23 tools)", "slack": "Send messages, manage channels (12 tools)" }, "tools": [ { "name": "activate_mcp_tools", "description": "Load full definitions for a group of MCP tools", "parameters": { "type": "object", "properties": { "group": { "type": "string", "description": "The name of the tool group to activate" } }, "required": ["group"] } } ] }
Why this matters
This small change has significant benefits:
- Lower token cost for users: Most users don't need every tool in every conversation. Why should they pay for the token overhead of functions they'll never touch? With advertise‑and‑activate, users only pay for what they use.
- Faster responses: Smaller prompts mean faster turns. Less JSON to inject means lower latency in every single interaction. For heavy MCP servers (think: one server for every AWS API), this could shave seconds off responses.
- Better scaling for LLM providers: Every extra token injected into context consumes GPU/TPU memory and compute cycles. Multiply that by millions of requests, and the impact on system throughput becomes significant. A leaner prompt means LLM providers can handle more requests with the same hardware — and pass those savings along.
- Improved model focus: By presenting only relevant tools when needed, models can maintain clearer reasoning paths. Instead of sifting through hundreds of irrelevant options, they can focus on the specific capabilities required for the task at hand, leading to more accurate tool selection and better performance.
- No more manual MCP juggling: Users can leave all their MCP servers connected without worrying about context window saturation. Instead of constantly enabling and disabling servers based on what they're working on, they can maintain a rich set of connected tools and let the client intelligently manage what the LLM sees.
- Enables feature-complete MCP servers: Right now, anyone building a comprehensive server has to worry about overwhelming context windows. A single MCP server exposing the entire AWS surface area is basically unusable. With this proposal, we can finally have feature-rich MCP servers — and the model won't drown in irrelevant schemas.
- Encourages better documentation: Tool authors can finally write the comprehensive, detailed descriptions that make tools effective without worrying about the token overhead. When only activated tools are injected, there's no penalty for thorough documentation. This removes the current perverse incentive to keep descriptions terse at the expense of usability.
A concrete example
Let's illustrate with a hypothetical scenario. Imagine an MCP server that provides access to your entire engineering stack:
Current client behavior:
Client connects to MCP server, receives all 300+ tool definitions
Client sends to LLM: All 300+ definitions (15,000 tokens) on every turn
User: "Can you check if our staging environment is healthy?"
LLM: [Searches through 300+ tools, uses 2 monitoring tools]
Total tokens per turn: ~15,500
With smart client behavior:
Client connects to MCP server, receives all tools AND summaries
Client sends to LLM: Only summaries + activation tool (150 tokens)
User: "Can you check if our staging environment is healthy?"
LLM: [Sees monitoring tools available, calls `activate_mcp_tools("monitoring")`]
Client: [Now includes 20 monitoring tool definitions in subsequent prompts]
LLM: [Uses 2 monitoring tools from the activated set]
Total tokens this turn: ~850
That could be a 94% reduction in token usage for this common task. The client intelligently manages what the LLM sees, dramatically reducing costs while maintaining full functionality.
What changes are needed
The beauty of this approach is that it can be implemented today with the current MCP protocol:
Server changes (optional quality-of-life improvement):
- Add tool summaries: While clients can already extract tool names and descriptions from existing tool definitions to build their own summaries, servers could optionally provide pre-organized summaries via a new
toolSummaries
field. This is purely a convenience — the existing tool definitions already contain enough metadata for clients to implement this pattern - No breaking changes: Servers continue to send full tool definitions exactly as they do today
Client changes (where the intelligence lives):
- Smart tool injection: Clients would choose to initially send only summaries to the LLM, dramatically reducing prompt size
- Built-in activation function: Clients add their own tool that lets LLMs request full definitions for specific tool groups
- State management: Track which tool groups are currently activated for each conversation
- Graceful degradation: When connecting to servers without summaries, fall back to current behavior
The key insight is that this pattern requires no MCP protocol changes — clients can implement it today by extracting summaries from existing tool definitions. The optional toolSummaries
field would simply make implementation easier and more consistent. All the intelligence for managing what the LLM sees lives in the clients, where it naturally belongs.
Potential drawbacks and challenges
While advertise-and-activate offers compelling benefits, it's not without tradeoffs:
- Added complexity: Clients become more sophisticated. They must track which tool groups they've activated, manage the activation lifecycle, and handle the extra prompt engineering. This adds implementation complexity primarily on the client side.
- Activation overhead: The first time a model needs a tool, it must make an activation call before using the tools. This adds an extra step to the conversation flow. Smart clients could mitigate this by pre-activating tool groups based on conversation context.
- Model confusion: LLMs need to learn a new pattern: they can't just call tools directly anymore. They must first recognize they need a capability, then activate it, then use it. This requires careful prompt engineering and potentially model fine-tuning for tool use trained models.
- Discovery challenges: With only summaries available, models might miss relevant tools they would have discovered by reading full descriptions. The quality of module summaries becomes critical.
- Caching complexity: Clients might want to cache activated modules across turns or conversations, but this introduces cache invalidation challenges when tool schemas change.
- Migration path: Existing MCP implementations would need updates. During the transition, we'd have a mix of servers supporting and not supporting this approach, requiring careful version negotiation.
- Security considerations: While the client has all tools available from the start, deferring their injection into the LLM context until activation time requires careful management of which tools are exposed when.
Conclusion
At GoDaddy, we build at internet scale — serving millions of small businesses with domains, hosting, and commerce tools. Our engineering teams work across a complex ecosystem:
- Multi-cloud infrastructure: We operate across many cloud providers and our own data centers. An MCP server exposing all our infrastructure APIs would easily exceed 500+ individual tools.
- Developer productivity at scale: With thousands of engineers, even small improvements in development efficiency compound. If each engineer saves 10 minutes per day through better tooling, that equals dozens of full-time engineers worth of productivity.
- Cost consciousness: We're always looking for ways to optimize costs without sacrificing quality. Token costs might seem trivial per-conversation, but multiply that by thousands of daily engineering interactions and it becomes a real budget line item.
- Internal tool explosion: Like many large tech companies, we have hundreds of internal tools and services. Making these accessible through MCP without overwhelming context windows is crucial for adoption.
We've been experimenting with MCP internally for tasks like:
- Automated operational incident response and debugging
- Infrastructure provisioning and management
- Code review help with domain-specific knowledge
- Customer support agent augmentation
In each case, the tool surface area quickly grows beyond what's practical with the current all-or-nothing loading approach. The advertise‑and‑activate approach isn't just a nice optimization. It determines whether MCP stays limited to smaller deployments or becomes production‑ready at enterprise scale.
MCP is still young. This is exactly the time to propose ideas like this — ideas that make the protocol cheaper, faster, and more scalable for everyone. Intelligently managing which tool definitions are sent to the LLM may sound like a small tweak, but it has first‑order impacts on cost, performance, and usability.
This concept is already gaining traction in the community. Projects like MCP-Zero and open issues in tools like Roo Code (#5373) are exploring similar on-demand loading approaches. If the MCP community embraces advertise‑and‑activate, we can keep the protocol lightweight, user‑friendly, and ready for the next generation of LLM applications.