Part V: Intents and Taxonomies

Posted in assistanttaxonomydataadd-on on AI/ML

In June 2025, Apple announced their upcoming Apple Intelligence features at their World-Wide Developer Conference WWDC-25. The year before, at WWDC24, there had been much talk of integrating apps with Siri.

Not so much this year.

In a post-keynote interview, Apple’s Senior Vice President of Software Engineering, Craig Federighi, admitted:

“We also talked about […] things like being able to invoke a broader range of actions across your device by app intents being orchestrated by Siri to let it do more things,” added Federighi. “We also talked about the ability to use personal knowledge from that semantic index so if you ask for things like, “What’s that podcast that ‘Joz’ sent me?’ that we could find it, whether it was in your messages or in your email, and call it out, and then maybe even act on it using those app intents. That piece is the piece that we have not delivered, yet.”

– This is what really happened with Siri and Apple Intelligence, according to Apple

Remember Intents?

Apple’s version of that, but for applications, it is called App Intents. It defines the interaction between an app and the operating system, allowing the system to do app discovery and deep interaction. Apple has been rolling this out since iOS 16 (2022), but its roots go further back to SiriKit and iOS 10 (2016).

The key feature of App Intents was to support:

Shortcuts (automation)
Spotlight (search)
Focus Filters (reduce distractions)
Hardware Invocation(i.e. Action button and Apple Pencil)
Siri (AI Assistant)

To show where they fit inside the architecture diagram from the previous section:

ℹ️ Side Note

To give credit where credit is due… Android has had a robust Intent mechanism since the early days.

On Android, apps have several Activities:

Explicit Intents: these point to a specific Acitvity.

Implicit Intents: describes high-level actions to be performed (i.e., capture a photo or >show a map). The system locates a component that can best handle the task.

Intent Filters: Google

Intent Filters: declaration of what actions the app can perform.

Common Actions: actions for common tasks (e.g., ACTION_SET_ALARM, ACTION_SEND, or ACTION_VIEW)

To get Assistant support on Android, you must also support App Actions.

In other words, to indicate user intent for apps, you not only have to use Intents but also App Actions.

In Android-world, Intents are not App Intents.

Are we clear?

Whether it’s App Intents, App Actions, Alexa AI Actions, or Model Context Protocol, the problems are the same. You want an AI Assistant to smash out of its closed world and peek at what’s outside.

That might seem simple, but in fact, it opens a large can of worms around Privacy, Discovery, Security, and so much more – all issues that Microsoft discovered painfully when they first created ActiveX Controls all the way back in 1996.

Original Sin

Let’s step back, say, 1800 years. There is an innate human desire to place messy knowledge into well-ordered, defined boxes. We owe this to the likes of:

Callimachus (3rd Century Librarian)
Carl Linnaeus (18th Century Biology)
Gottfried Wilhelm Leibniz (18th Century Philosopher)
Melvil Dewey (19th Century Librarian)
Charles Darwin (19th Century Biology)
Dmitri Mendeleev (19th Century Chemist)
Peter Mark Roget (19th Century Etymologist)
DSM-5-TR (20th Century, Mental Health)

This is the sin of Pride, of humans trying to assert their superiority over nature and create an ordered construct out of disorganized chaos.

What Classification and its sibling Taxonomy seek is to force something wild, messy, and unkept into a semblance of civility:

The list of taxonomies is long. But as the preface to the List of chemical classifications says:

This is a dynamic list and may never be able to satisfy particular standards for completeness. You can help by adding missing items with reliable sources.

In tech, a classification scheme is needed to create a central choke point where everything can be validated, vetted, and verified. Anything that does not fall into place is discarded.

Technologies like Object-oriented programming, DNS, and SSL Certificates advocate an orderly adoption of root objects, from which everything else derives.

Unfortunately, reality leaves a lot of room for interpretation:

Going back to App Intents (also App Actions, Alexa AI Actions, or MCP), the problem is how to harness and categorize all these capabilities while allowing AI Assistants to make a decision on a user's behalf.

Remember Implicit Intents? The system is responsible for locating a component that best handles the task. What if there are multiple tasks? Or none installed but a dozen available on the Play Store?

Turns out matching a user request to an Intent is a Really Hard Problem if you don’t want to keep coming back with:

I’m sorry Dave, I’m afraid I can’t do that.

Yes, I know that was HAL trying to protect itself. But that’s a lot catchier than:

❗ Key Point

Rigid taxonomies remove the free-form user interaction which these LLM-based AI Assistants promise.

Alexa researchers covered this problem in their paper on dynamic arbitration:

Alexa now has more than 100,000 skills, and to make them easier to navigate, we’ve begun using a technique called dynamic arbitration.
For thousands of those skills, it's no longer necessary to remember specific skill names
or invocation patterns (“Alexa, tell X to do Y”). Instead, the customer just makes a request, and the dynamic arbitration system finds the best skill to handle it.

Naturally, with that many skills, there may be several that could handle a given customer utterance. The request “make an elephant sound,” for instance, could be processed by the skills AnimalSounds, AnimalNoises, ZooKeeper, or others.

…

Accurate multilabel annotation is difficult to achieve, however, because it would require annotators familiar with the functionality of all 100,000-plus Alexa skills. Moreover, Alexa’s repertory of skills changes over time, as do individual skills’ functionality, so labeled data can quickly become out of date.

This is where forcing Intents and Extensions into a fixed taxonomy comes back, and bites you in the ass (technical term of art).

Amazon pointed out in 2018 that asking users to name what skill to match their request wasn’t user-friendly. A better way would be to adopt what they called Name-free Interactions:

When Alexa receives a request from a customer without a skill name, such as “Alexa, play relaxing sounds with crickets,” Alexa looks for skills that might fulfill the request. Alexa determines the best choice among eligible skills and hands the request to the skill. To provide a signal for Alexa to consider when routing name-free requests and enable customers to launch your skill and intents without knowing or having to remember the skill’s name, consider adding Name Free Interactions (NFI) container to the skill.

…

Skill launch Phrases are an optional new way to teach Alexa how an end-customer might invoke your skill as a modal launch, along with the standard invocation pattern of “Alexa, open ”. For example, “Alexa, Open Tasty Recipes Skill” you might have something like “can you give me a tasty recipe.”

They go on to state it openly:

Finding the most relevant skill to handle a natural utterance is an open scientific and engineering challenge
for two reasons:

The sheer number of potential skills makes the task difficult. Unlike traditional digital assistants that have on the order of 10 to 20 built-in domains, Alexa must navigate more than 40,000. And that number increases each week.

Unlike traditional built-in domains that are carefully designed to stay in their swim lanes, Alexa skills can cover overlapping functionalities. For instance, there are dozens of skills that can respond to recipe-related utterances.

Actually, as of 2024, Alexa skills reportedly number around 160,000.

What about iOS and Android?

According to 42 Matters, there are over 1.95 million apps on the iOS App Store and over 2 million on Google Play.

Now consider that not all these apps will be candidates for Intents and AI Assistant invocation. But still, the number is significantly large. Apple and Google can restrict your use of Siri and Google Assistant to the list of compatible apps you have installed on your phone or gateway, but that means the user is not free to just throw out a question and have it answered.

Discovery returns to being a user burden. And as Alexa showed us, that creates friction.

A Detour Into Devices

The problem with a taxonomy is that everything must fit into the right category. Smart Home systems adore categories:

Alexa ConnectKit Device Types
HomeKit Accessory Category Types
Home Assistant Device Class
Matter Device Types (co-designed by Apple)
Google Device Types

Device types harness the power of the Google Assistant’s natural language processing. For example, a device with a type light can be turned on in different ways:

Turn on the light.

Turn my light on.

Turn on my living room light.

Of course, you can add your own custom categories and traits. But then your device will no longer be easily accessible via The One Right Way. You must either drop the feature or provide your custom app and skill.

❗ Side Note

A constant push-pull exists between the desire for conformity and consistency and damnable, non-conformist innovation.

This is a key problem at the heart of these systems, and any that want to offer expandability via extensions, skills, plug-ins, etc.

I should know.

A Cross-Platform Plug-in Toolkit © Doctor Dobbs Journal - 1993.

Back To The Business

To make the discoverability problem for the next version of Siri even more difficult, Apps have to provide hints as to what category they fit in, which means an app may not be matched well, depending on the semantics of the user request.

See how hard the problem gets?

Let’s go back to what Apple’s Federighi said at WWDC25. Here’s the extended quote:

Federighi: About half of our Siri section talked about things like being able to invoke a broader range of actions across your device by App Intents being orchestrated by Siri to let it do more things. We also talked about the ability to use personal knowledge from that semantic index, so if you asked for things like you know “What’s that podcast that Joz sent me,’ that we could find it whether it was in your messages or in your email and call it out and then maybe even act on it using those App Intents.
That piece is the piece that we have not delivered yet.
We found that when we were developing this feature that, we had really two phases to two versions of the ultimate architecture that we were going to create, and version one we had sort of working here at the time that we were getting close to the [WWDC24] conference and had at the time high confidence that we could deliver it; we thought by December [2024], and if not, we figured by Spring [2025].

So we announced it as part of [2024] WWDC because we knew the world wanted a really complete picture of what’s Apple thinking about, the implications of Apple Intelligence, and where’s it going. We also had a v2 that was a deeper end-to-end architecture that we knew was ultimately what we wanted to create to get to the full set of capabilities that we wanted for Siri.

Well, so we demonstrated that v1 architecture in that video working, and we set about for months making it work better and better across more App Intents. Better and better for doing search. But fundamentally, we found that the limitations of the v1 architecture weren't getting us to the quality level that we knew our customers needed and expected right.

And we realized that v1 architecture, you know, we could push and push and push and put more time, but if we tried to push that out in the state it was going to be in it wouldn’t meet our customer expectations or Apple’s standards, and that we had to move to the v2 architecture.

[And that] as soon as we realized that, and that was during the Spring [2025], we let the world know that we weren’t going to be able to put that out, and we were going to keep working on really shifting to the new architecture and releasing something.

We have not wanted at this point, given our experience, to precommunicate a date until we have in-house the v2 architecture delivering not just in a form that we could demonstrate for you all, which we could do; we’re not going to do that. But we’re perfectly capable we have, you know, the v2 architecture, of course working in-house, but we’re not yet to the point where it’s delivering, you know, at the quality level that I think makes it a great Apple feature.

So we’re not announcing a date for when that’s happening. We will announce the date when, you know, we’re ready to seed it, and you’re all ready to be able to experience it. And that’s really where we are.

Spoonauer (Tom’s Hardware): So instead of an actual release date, is it okay to say 2026, or is that too…

Joswiak: Yeah, that’s what we said.

Spoonauer: Yeah, okay okay.

Ulanoff (Tech Radar): So really what we’re talking about top to bottom rebuild of Siri. I mean this is not when you say v2, when you talk architecture, that’s really, you know, top to bottom. That sounds like what we do finally get next year probably is a completely rebuilt Siri that's purpose-built for what you're talking about.

Federighi: I should say, the v2 architecture is not… it wasn’t a startover. The v1 architecture was sort of half of the v2 architecture, and now we extend it across, sort of, make it a pure architecture that extends across the entire Siri experience.

So we’ve been very much building on what we had been building for v1 but now extending it more completely, and that more homogeneous end-to-end architecture gives us much higher quality and much better capability. So that’s that’s what we’re building now.

The first part (Siri orchestrated app intents) sure sounds like Alexa’s Name-free interactions. Same ~~wolf~~ coyote, different sheep’s clothing.

The second part (semantic index) may seem unrelated, but it is because Apple wants to search content associated with you. But if you’re not using Apple’s apps (i.e., Email, Message, or Calendar), your personal knowledge may be out of reach, locked inside an app. Then it won’t work, and the user experience is…

Federighi’s “What’s that podcast that Joz sent me?” could be sent via Signal, WhatsApp, or Messenger. That podcast could be played inside Apple Podcasts, Spotify, or many other third-party podcast apps.

If Apple wants to let you have a seamless Siri experience and answer that question (without you explicitly naming apps), they will have to implement a version of Name-free interactions. What’s more, they’ll have to do it without relying on users running everything through the AppStore installer (Hello,Third Party App Stores).

What’s more, Apple’s App Intents framework requires that developers implement and offer what they can do as code, hard-coded into their app. Also note that as of this writing (early June 2025), Apple’s App Intent page issues a note:

Note

Siri’s personal context understanding, onscreen awareness, and in-app actions are in development and will be available with a future software update.

This is a clue as to why Apple was mumm on Siri in WWDC25. Apple did announce that developers could make use of Apple’s on-device Foundation Models inside their apps. This was a HUGE bit of functionality most reporters and analysts missed.

But this will make the problem worse by complicating how apps and on-device LLMs interact

(we’ll get to that next).

What To Do?

The problem actually goes even deeper.

We haven’t even gotten to fun topics like Accessibility, Dynamic Discovery, Security, Security, and my favorite, Payments (i.e., who pays for all this).

As the inimitable Molly Ivins used to say:

“The first rule of holes: When you're in one stop digging.”

MCP to the Rescue

I won’t sugar-coat it.

The AI industry may be barreling headfirst into another ActiveX-like disaster, given how rapidly it’s embracing MCP (warts and all).

At the MCP Dev Summit, I watched experts repeatedly kick the proverbial can down the road.

“It’s early days…”

“The Spec needs work…”

“Yeah, but it gets the job done…””

There are multiple fragmented MCP search engines to help you discover MCP servers out in the open, each indexing a different number of add-ons (as of this writing):

AIxploria: 421
APITracker.io: 108
HiMCP: 11,659
MCP Server Finder: 1235
MCP.so: 15,433
MCPdb: 363
Pulse MCP: 4683
Smithery: 7300

Also:

And so many more… including none other than Anthropic itself.

Just look at the index of self-reported servers kept by the official Model Context Protocol organization or the obligatory list of Awesome MCP Servers.

By listing all these, I’m trying to make the point that whereas Apple, Amazon, and Google only deal with add-on apps in their singular app directories, the MCP world is fragmented across dozens of search indexes.

There is a robust discussion on the design of an official MCP Server Registry, but after talking to the primary designers at the MCP Summit, I get the feeling a lot of the heavy lifting is going to be pushed down to the (fragmented) downstream registries.

“The first rule of holes: When you're in one stop digging.”

Security

In regards to the MCP Spec… at this point in time and given what the history of abuse unvetted extensions foisted on unsuspecting users, I’m astounded anyone would rush out a spec without at least a basic security audit.

The spec is riddled with holes and unspecified regions, leading to posts like:

And my favorite title:

The “S” in MCP stands for Security.

Also, scary diagrams like:

Invariantlabs MCP Security Notification: Tool Poisoning Attacks

The MCP Specification itself says:

Remember Alexa’s >Name-Free Interactions?

Enables customers to naturally interact with Alexa skills and reduce friction when customers can’t remember how to invoke a skill or use incorrect invocation phrases.

It Gets Worse

I love Home Assistant. It’s a fantastic, open-source home automation platform that tries to do as much as possible locally and privately.

But I’m questioning their headlong rush into letting LLMs control your physical environment. They’ve even gone so far as to add their own Home Assistant Voice, which works based on the same architecture we covered in the last section.

They’re not alone in mixing LLMs and the physical world:

This means an LLM can have open-ended access to monitor and modify your physical devices and environment.

Apple faces a similar problem when there is a demand to open its Home App to Siri control. No, wait, it can already do that.

But that’s Old Skool Siri. V2 will, no doubt, use Apple’s own Foundation Models. According to the just-announced Developer access to Apple’s foundation models, they already support Tools.

Voice + Open-ended Tools = ❤️

What to do?

Are we doomed?

If we go down the same path of insisting that applications conform to fixed taxonomies, don’t build guardrails around adding extensions to AI Assistants, and keep kicking the can down the road…

But it doesn’t have to be that way.

Off the top of my head, there are a lot of issues (listed alphabetically) that need to be addressed:

Accessibility
Authentication
Authorization
Business disruption
Data sovereignty (GDPR, CCPA, etc) as well as limits on data reuse
Discovery
Fallback and failure modes (aka redundancy)
OTA updates
Payment
Privacy
Protocol and Standard Evolution (aka versioning)
Proxying (assigning responsibility)
Regulations
Security
Third-party dependecies

We’ll want to make sure there are at least acknowledgments of these. But we don’t want to slow down innovation. After all, as the saying goes: Perfect is the Enemy of Good. There could be placeholders where they can evolve without making breaking changes.

Google’s Agent2Agent Protocol (A2A) is a solid step in the right direction:

A2A Servers MUST make an Agent Card available.

The Agent Card is a JSON document that describes the server’s identity, capabilities, skills, service endpoint URL, and how clients should authenticate and interact with it. Clients use this information to discover suitable agents and configure their interactions.

More can (and should) be done before LLMs are injected into AI Assistants. I haven’t had a chance yet to go over the new Alexa+ specs, but that’s top of my Summer reading list (as soon as I get approved for early access—perhaps one of my former AWS/Labs colleagues can pull some strings 😬).

Events

The MCP/A2A agentic model is predicated on the Request/Response model. A whole other universe of Events is waiting to be explored.

Instead of you requesting something from an LLM and getting a response, indicate your interest and have the LLM call you when there's something you should know.

This is not new. Pub/Sub, meet LLM.

In subsequent sections, we will discuss ways to prepare for this future, and maybe head things off at the pass.

Next.

Title Image via The Questionable Authority on BSKY

Part V: Intents and Taxonomies

Original Sin

A Detour Into Devices

Back To The Business

What To Do?

MCP to the Rescue

Security

It Gets Worse

What to do?

Events

Ramin.Work

Error

Original Sin

A Detour Into Devices

Back To The Business

What To Do?

MCP to the Rescue

Security

It Gets Worse

What to do?

Events

Templates (for web app):

Error