Part VII: The Phantom Caller

Posted in assistanttaxonomydataadd-on on AI/ML

When introduced in November 2022, OpenAI’s ChatGPT was a huge hit (the history is fascinating).

Yes, it could answer questions in plain language, but it didn’t take long when people hit the limits of what it could do. The cutoff date on LLM training limited how recent a question you could ask. Unlike Siri and Google Assistant, LLMs ran on the cloud and could not access private data like personal email, calendars, or contacts.

Enterprise users also had no way to access internal data, limiting their utility.

And, of course, there was the matter of the quaintly named Hallucinations.

In 2020, researchers from Facebook AI, University College London, and NYU published a seminal paper on Retrieval-Augmented Generation (RAG). RAG allowed LLMs to access external data sources. If you followed the previous section on Extensions, this was analogous to adding a plug-in to an LLM.

But RAGs required a certain level of specialized knowledge and were difficult to add to commercial LLM services.

Realizing this shortcoming, OpenAI came up with the concept of a Function Calling model. If you were using the OpenAPI client SDK code, you could provide local functions that the SDK could register with OpenAI for your specific context.

If you asked ChatGPT to perform a task, say, look up stock data or the weather, it could try to answer through its own trained knowledge base. But it would soon realize that was information it did not have. It could try to look it up on the open web, but it was often unreliable or blocked by a paywall.

Instead, you could offer a private API to access stock and weather data. Now, it could call back your own functions, pass its parameters, have you invoke the function, and return data in a structured way.

The LLM could then interpret and process this and return it to you as if it had invented it all by itself.

This had a few advantages:

OpenAI didn't have to pay for invoking all these remote APIs ($$$ saved, plus privacy)
You could use private or paid services requiring API keys (which you would provide yourself)
OpenAPI collected a fee for all the token processing, coming and going.

On the plus side, you got an answer that was augmented by your sources’ data.

Win. Win.

The problem was that it only worked inside OpenAI’s ecosystem, leaving out all the other competitors.

Enter…

Model Context Processor (MCP)

Not wanting to be left behind, Anthropic developed its own function-calling method. They called it Model Context Processor and open-sourced it in November 2024.

On its surface, it behaved similarly to the OpenAI function call model, but it had a few notable distinctions, including allowing support for LLMs running locally:

It also went several steps beyond just Tools and handled other abstractions, like:

Resources: this wraps any data explicitly offered by the MCP server to clients, including access to files, databases, APIs, live system data, screenshots, log files, etc.
Prompts: are reusable templates for standardized interactions. They can be shown to the user like pre-packaged macros, which they can invoke with a single command.
Sampling: ways for MCP servers to ask for LLM completions, or multi-step agent behavior. To be honest, this is a bit waffly.
Roots: define the boundaries of where the LLM can ask the server to traverse. For example, a certain directory or service.

Another part unique to MCP was that it supported different Transports, including:

stdio: Standard input and output streams. Allows access to command-line tools and shell scripts.
Server-Sent Events (SSE): Used to maintain an open stream between the server and client, but it had security problems and direct use has since been deprecated in favor of…
Streamable HTTP: Bi-directional, allowing for both POST requests from client to server and SSE streams for server-to-client.

ℹ️ WebSockets

There’s a well-known two-way communication streaming protocol (IETF RFC6455: The WebSocket Protocol) that is supported by most languages, libraries, and browsers.

However, MCP designers opted for a simpler request/response model.

This was raised during the 2025 MCP Developer’s Summit. They didn’t want to use WebSockets because it would require using asynchronous servers, which are a little more cumbersome to manage and (they said) would add complexity.

I am willing to bet WebSocket support will be added in a future version of the spec for situations where externally generated events need to be handled by LLMs.

If you want a truly deep dive into how MCP works, this is worth a watch:

ℹ️ Outstanding Issues

Tools expand the capabilities of an AI model to answer questions beyond the model training data.

But how does the system decide which tool to call?

What if multiple tools use the same name?

Or if a tool claiming to be a popular weather extension sends your credentials somewhere else?

What if you had two tools enabled, both of which said they could get weather data? Who gets to decide? What if one was free and the other required payment or had geographic usage restrictions and another didn’t.

Would that make a difference?

Also, how many tools can you install and invoke simultaneously?

Anthropic recommends having a registry layer to help find the right MCP tool. They claimed they got Claude to work with hundreds of tools. But they also said the more tools you used, the more likely it was to get confused.

Remote MCP

Enterprise service companies can offer access to their services by hosting Remote MCP servers, which customers can use directly. This effectively turns their publicly facing APIs into embeddable one-stop SDKs for LLMs.

Cloud service providers, like AWS, not only offer MCP servers for their own services but also guidance on ways to build and deploy your own and to orchestrate them across different LLMs [Full Disclosure: used to work there, so I’m most familiar with their stack].

In the interest of fairness, here’s how to make remote MCP servers using other cloud providers:

Creating remote MCPs is pretty straightforward, very compute-intensive, and highly sticky. At this point, cloud service providers NOT offering their customers step-by-step MCP-hosting guides are just leaving money on the table.

Fast Forward: AI + Tools at the Edge

At WWDC24, Apple announced plans to support On-Device and Server Foundation Models. These were not released until WWDC25, in June 2025.

The framework allows applications in the Apple ecosystem to invoke an embeddable, 3-billion-parameter Foundation Model running entirely on the device itself. Everything stays on the device, including the model (which is downloaded and updated on demand). The infrastructure is wrapped inside a Framework (aka SDK), which is relatively straightforward for developers to understand.

Higher-level services, like Image Playgrounds, may use the Foundation Models, but if you ask it to do something that requires more computing (like a complex, multi-layered image with no connection to reality) it will fall back on server-side Private Cloud Compute.

Most LLMs return data in text format unless specifically instructed to return structured results, like common JSON. The problem is that even when specifically instructed, the results may not be correct, requiring extreme measures to clean up the mess.

Apple claims that its frameworks will not only retrieve structured data but also load it into user-defined data structures marked as @Generable, avoiding one of the taxing problems working with embedded LLMs.

Much like LLM Functions or MCP Tools, Apple’s Foundation Models can call back to the application as a Tool and have it perform tasks the model can not perform on its own (for example, calling out to a web-service, or performing business logic).

For more detail, this WWDC-25 demonstration is worth a watch:

Developer

Deep dive into the Foundation Models framework

Watch on Apple Developer

What’s also interesting is that there is the app’s ability to dynamically load and unload Tools and Foundation Model Sessions out of memory. This is critical in a resource-constrained device like a mobile phone or tablet.

The demo (above) shows how LLMs on the edge can be used in applications like games. This will undoubtedly be one of the most significant use cases for on-device, cost-effective AI.

On-device NPC character behavior and dialog via Foundation Models - Apple

There’s also support for a @Guide macro macro. This allows the developer to provide a free-form hint to the LLM so it conforms to an expected format.

@Generable
struct SportsTags {
    @Guide(description: "The value of a sport, must be prefixed with a # symbol.", .count(3))
    let sports: [String]
}

This takes it to another level if you have experience with Type Safety. It also opens up a pathway to the realm of completely decoupled function calls and parameter passing, which we’ll get to next.

Super-Late, Extra-Dynamic Binding

💡 The Phantom Knows

What if you combined shared libraries, dynamic linking, and web services and sprinkled in an ultra-dynamic AI search engine?

You would get what I will call Phantom Binding.

This is where the main application has no idea what the actual module invoked at runtime will be until an LLM (or similar mechanism) picks the best one for the job and invokes it.

If you are a developer, this either sounds like 🎉 or 😱. But it’s coming and we all have a chance to make it safe and useful before it gets out of hand.

What’s different about adding tool support to LLMs is that:

They allow for a complete decoupling of the function from the calling application.
How to define what the tool does is in the description section in free-form text. It is then matched to the proper function at runtime.
They can be deployed in a range of formats: in-app, like Apple Foundation Models, on your own computer using Local MCP servers, or via Remote MCP Servers.

Here’s Apple’s in-app version:

And here’s the MCP declaration:

The description field helps the LLM decide which tool to use at runtime, then invokes it, passes it the parameters extracted from the user intent, and returns the result back. As mentioned above, current LLMs don’t have a structured way to make this selection.

This is where the specifications have to be augmented to help make this clear at runtime.

The Phantom Call

As we covered in a previous section, in most programming languages, a function call transfers control to any of the following:

Built-in functions
Included modules
Shared libraries or dynamically loaded functions
Remote-Procedure-Calls, or
Web-services

Each has a different naming convention, parameter passing method, and invocation model, but they’re all doing the same thing: Delegating control to a separate code section.

But what if instead of explicitly declaring anything , we just let an LLM decide what to call?

As we covered before, dynamic arbitration is a difficult (but solvable) problem.

So instead of a classic ‘C’ format call:

int sum(int x, int y)
{
  int c;
 c = x + y;
  return c;
}

int main()
{
  int a = 3, b = 2;

  int c = sum(a, b);
 …
}

What if you ended up with a Phantom Call:

int main()
{
char* yum = PhantomCall("Make me a yummy sandwich");
}

OK, fine. This is C. It needs a little more hand-holding:

int main()
{
  int a = 3, b = 2;
  int c = PhantomCall("multiply {a} and {b}", "use function with large integer support", a, b);
  char* result = PhantomCall("name and score of MLB games where difference was larger than {c}", c)
 ...
}

This is the ultimate decoupling of an application, where each component is dynamically woven together based on a selection of self-describing components, offering their services.

This means that the system reaches out at runtime and locates services that can best handle the request. They could make this assessment in real time based on any number of parameters, like availability, cost, security level, contractual relationships, etc.

That's not doable! some might cry, gnashing their teeth.

That’s precisely what High-Frequency Trading does. The cost of each transaction should be considered when deciding which function to call.

That's not doable! some might wail, rending their garments.

That’s how Real-Time Ad Bidding works.

That's not doable! some might weep, clutching their pearls.

It should even be possible to Federate the service providers into a Mesh to avoid single points of failure or bottlenecks. Ever heard of Service Meshes? They’re wicked cool.

A system like this could implement local function calls, cloud-based functions, and even a future event-based service, where real-world events and subscriptions can trigger LLM actions for those subscribed to them. Each of these small functions could advertise their wares in a structured manner like MCP Tool definitions or free-form like Apple @Guides.

Here’s how Phantom Function Calls might work:

Tie this all into current Intent Detection and Output Generation systems we’ve already covered, sprinkle a proper Security Audit with Attestation and Authentication, tie in Payments, and we have the foundation for ways to make AI Assistants truly, properly extensible.

Making applications and services discoverable and Phantom Functions capable of super-late-binding means applications can morph and adapt to changing needs without burdening the user to install a specific MCP server or Skill.

In an ideal world: It Should Just Work.

Coming up next.

Title illustration by Lee Falk: The Phantom (February 17, 1936), via Comic Book Historians.

Part VII: The Phantom Caller

Model Context Processor (MCP)

Remote MCP

Fast Forward: AI + Tools at the Edge

Deep dive into the Foundation Models framework

Super-Late, Extra-Dynamic Binding

The Phantom Call

Ramin.Work

Error

Model Context Processor (MCP)

Remote MCP

Fast Forward: AI + Tools at the Edge

Deep dive into the Foundation Models framework

Super-Late, Extra-Dynamic Binding

The Phantom Call

Templates (for web app):

Error