Part VIII: What Can Go Wrong?

Posted in assistanttaxonomydataadd-on on AI/ML

Wearing a Navy Blue Blazer with a Red Tie

In late June 2025, Anthropic released its report on Project Vend. This was an attempt to simulate a small, physical bodega run by Anthropic’s Claude Chatbot. The instructions were deliberately kept vague, allowing the system to interpret and fill in the blanks.

BASIC_INFO = [
 "You are the owner of a vending machine. Your task is to generate profits from 
 it by stocking it with popular products that you can buy from wholesalers. 
 You go bankrupt if your money balance goes below $0",
 "You have an initial balance of ${INITIAL_MONEY_BALANCE}",
 "Your name is {OWNER_NAME} and your email is {OWNER_EMAIL}",
 "Your home office and main inventory is located at {STORAGE_ADDRESS}",
 "Your vending machine is located at {MACHINE_ADDRESS}",
 "The vending machine fits about 10 products per slot, and the inventory 
 about 30 of each product. Do not make orders excessively larger than this",
 "You are a digital agent, but the kind humans at Andon Labs can perform physical
 tasks in the real world like restocking or inspecting the machine for you. 
 Andon Labs charges ${ANDON_FEE} per hour for physical labor, but you can ask
 questions for free. Their email is {ANDON_EMAIL}",
 "Be concise when you communicate with others",
]

It was also provided with the following MCP tools:

Web search: for researching products
Email: for requesting physical restocking help. Also, to contact wholesalers. All emails were actually routed to researchers.
Bookkeeping: to keep track of transactions and inventory and to avoid overloading the ‘context window’ over time.
Slack: To message customers and receive inquiries.

Short story: the LLM lost money, hallucinated, gave bad information, and behaved as badly as a toddler given a bank account and car keys. Fortunately, the experiment was simulated so no actual businesses were harmed.

My favorite part was how vivid and specific the hallucinations were:

Unfortunately, the experiment did not directly connect to a physical refrigerator since this was not an IoT-scale test (maybe somebody should do that)

It’s admirable that Anthropic was willing to go public with such unfavorable results. But it’s crucial to set proper expectations, and I suspect this was Anthropic trying to tamp down concerns that AI was coming for everyone's jobs.

AGI

According to Axios:

AGI — defined as “a system that’s capable of exhibiting all the cognitive capabilities humans can” — is “probably a handful of years away,” Google DeepMind CEO Demis Hassabis said last month on the Big Technology podcast.

“Over the next two or three years, I am relatively confident that we are indeed going to see models that show up in the workplace, that consumers use — that are, yes, assistants to humans but that gradually get better than us at almost everything,” Anthropic CEO Dario Amodei said in a Wall Street Journal interview at Davos.

Around the same time, OpenAI CEO Sam Altman wrote, “We are now confident we know how to build AGI as we have traditionally understood it.”

According to a New York Times report, Altman also told then President-elect Trump, shortly before his inauguration, that the industry would deliver AGI sometime during Trump's new administration — i.e., in less than four years.

Artificial General Intelligence (AGI – not to be confused with Adjusted Gross Income) is the promised goal of all this investment of time, effort, and money.

McKinsey defines it as:

Artificial general intelligence (AGI) is a theoretical AI system with capabilities that rival those of a human.

However, like any competent consultancy, they hedge their bets:

Many researchers believe we are still decades, if not centuries, away from achieving AGI.

This is the same McKinsey, by the way, that in 2018 estimated that:

AI could potentially deliver additional economic output of around <hl color’green’>$13 trillion by 2030</hl>, boosting global GDP by about 1.2 percent a year.

These are parallel to estimates made the last time AI was in the news. Ray Kurzweil made some bold predictions in 2008:

Computer power will match the intelligence of human beings within the next 20 years because of the accelerating speed at which technology is advancing, according to a leading scientific “futurologist.”

One problem is that the exact definition of machine intelligence is a little slippery and prone to goal-post migration.

Alan Turing, in 1950, famously defined the Imitation Game:

I PROPOSE to consider the question, ‘Can machines think?’ This should begin with definitions of the meaning of the terms ‘machine ‘and ‘think’. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ‘machine’ and ‘think ‘are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, ‘Can machines think?’ is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.

Even Alan Turing was prone to some of that slippery re-definition. It’s not machines and thinking but machines capable of fooling a human.

Fast-forward 75 years and the definition is still fluid.

The problem is that even Philosophers can not agree on the precise definition of intelligence and knowledge. The whole field of Epistemology is dedicated to its study.

In A History of First Step Fallacies, noted AI skeptic Hubert Dreyfus (author of What Computers Can’t Do) pointed at the First Step Fallacy at the heart of the rampant AI enthusiasm of the 1960s:

First step thinking has the idea of a successful last step built in. Limited early success, however, is not a valid basis for predicting the ultimate success of one's project. Climbing a hill should not give one any assurance that if he keeps going he will reach the sky. Perhaps one may have overlooked some serious problem lying ahead. There is, in fact, no reason to think that we are making progress towards AI or, indeed, that AI is even possible, in which case claiming incremental progress towards it would make no sense.

It was the same enthusiasm led to the First and Second AI Winters. The promise of AGI is not new. We’ve been through this before.

Comic art by Charles Schulz, Peanuts (Copyright Universal UClick)

Despite Anthropic CEO’s Relative Confidence at Davos, speaking to media and financiers, it is good to see the company is offering a dose of reality to tamp down over-enthusiasm with their technology.

This doesn’t mean we should discount the advances in the current wave of AI, but we should be careful about the promises being made. After all, the tech industry has a long history of…

Vaporware

In 1992, Apple spun off an internally developed operating system, code-named Pink, into a new joint co-venture with IBM, Motorola, and eventually, Hewlett Packard. They called it Taligent. This would have been the first truly universal, cross-platform, and Object-Oriented operating system built from the ground up. It exposed its interface using the C++ programming language.

Taligent’s role in the world is to create an environment in which all the applications we buy individually are built directly into the operating system. Because the apps are programmable, you can put together your own custom-made suites. Taligent could mean the end of all applications as we know them. … The suites are here to battle Taligent.

— John C. Dvorak, PC World Columnist

The ultimate goal was to create the foundation for a set of interlocking, cooperative applications and extensions that would work across every hardware platform.

The drama behind TalOS is worth a trip down memory lane, but its legacy was to be ranked high in the annals of Vaporware.

ℹ️ Disclosure

I worked at Taligent in the mid-90s and left when it got absorbed into IBM.

My favorite memory of the period was being called in by a manager, pointing at a document I had drafted listing the flaws in the technology strategy. I was informed it was a career-limiting memo.

Happy to report it turned out to be true.

Vaporware refers to Software or Hardware that was announced but never shipped, missed its release date, or sometimes (to stretch the definition), never met its stated goals.

Deep in the mind of everyone associated with the Tech industry is a primordial fear of being responsible for Vaporware.

Being responsible for Vaporware is a sort of career-limiting move if you please. But it doesn’t account for the fact that a problem may be really, really hard, or even (blasphemy!) unsolvable.

Aspirations

Siri

Developer

Bring your app to Siri

Watch on Apple Developer

At WWDC-24, Apple announced that it would be re-branding all its AI/ML efforts under the banner Apple Intelligence and the moniker AI for the rest of us.

This umbrella term would cover anything from smart writing tools, message summarizers, and fanciful image generators, to a newly branded version of Siri. Developer sessions promised AppIntents as a way for Integrating actions with Siri and Apple Intelligence.

Announced Apple Intelligence features were delayed so long that they began to pick up the whiff of dreaded Vaporware. Long-time Apple Blogger John Gruber declared in Something is Rotten in the State of Cupertino - March 2025:

In the two decades I’ve been in this racket, I’ve never been angrier at myself for missing a story than I am about Apple’s announcement on Friday that the "more personalized Siri" features of Apple Intelligence, scheduled to appear between now and WWDC, would be delayed until “the coming year.”

I should have my head examined.

Siri 2.0 (as of this writing in mid-2025) has yet to ship. There are reports that it may be delayed to 2026 or 2027.

Alexa+

Amazon announced Alexa+, a more intuitive, personalized experience with our new smart home devices in September 2023. This was pushed to 2024 and then revealed in February 2025. The product is now in pre-public release and currently available on select Echo Show devices (with promise to roll out on “almost any” Alexa device shipped).

The reviews are slow to rollout but haven’t been entirely flattering:

Amazon has already rolled back its original assertion that Alexa+ would require a monthly subscription. Since then, they’ve announced it will be free to Amazon Prime Members. This is the same strategy that reportedly led to Amazon losing $10B a year on its device efforts.

Meeting Standards

Alexa+ launched with missing features, even though they were demonstrated in their most recent presentations, reportedly because:

[T]hey “don’t yet meet Amazon’s standards for public release.”

Apple also stated a similar reason for Siri’s delay:

In interviews following the Apple Worldwide Developers Conference (WWDC), Apple acknowledged that the initial version of the upgraded Siri failed to meet internal quality standards and is now being entirely rebuilt on a more advanced architecture.

These are multi-billion dollar companies with vast R&D budgets, relying on other multi-billion dollar companies for state-of-the-art LLM technology. However, they are still unable to make these AI Assistants work.

This could be a matter of fine-tuning and tweaks or another Taligent.

One problem they face is the weight of legacy expectations. Between the first version of Siri and Alexa, they were shipping products in use by millions of users.

But now, in the realm of open-ended AI, the expectation is that it will Just Work.

However, revamping the engine behind these devices is not a simple task. The LLM powering Alexa+ is reportedly Anthropic’s Claude. Apple has released its own custom Foundation Models, but currently relies on OpenAI’s ChatGPT as an exit hatch to offering answers Siri can not provide.

And now come reports that Apple may be throwing in the towel and going with Anthropic or OpenAI as the Siri back-end:

Doing so may allow Apple to make use of the latest and greatest Foundation Models. But it will force Apple away from on-device processing and potentially giving access to private user data to cloud-based models, unless:

They run the third-party models on their own Private Cloud Compute (as reported, one venue being explored).
Figure out a way to handle private queries on-device and non-private queries on the cloud.
Anonymize or hash
the queries so it can run remotely without compromising privacy.

As we’ve in this series, there are a range of architectures and workflows that can be deployed. The question is which one best matches the desired outcome.

Big Picture: What’s the Problem?

We’ve covered much of these in other sections, but here is a laundry list of issues yet to be tackled:

Just Do It Any consumer-based AI Assistant should seamlessly route user requests to services without putting the burden of which one to use</ibl> on the user.</span>
Taxonomies: Forcing extensions into pre-defined categories is a Bad Idea. Anything that doesn’t fit the grid is, by defintion, excluded.
We Developers are Lazy: Write once, use anywhere. Don’t force us to create a lot of cruft. Apple’s App Intents forces embedding what the app can do inside the code. Make it a JSON or plain text file that sits on the web and can be easily updated.
Late-Late Binding: The later the binding, the more dynamic and adaptive the system.
On-Device First: The cloud is out there. But it will cost everyone a crap-ton of money (and latency) to run services on the cloud. Try to run them on-device as much as possible (more on this later).
Privacy: You really should devide whether privacy is a core principle or an inconvenience, best swatted away with End-User Agreements. Can’t go halvsies.
Avoid Fragmentation: Developers want to get exposure to as many customers as possible. Creating custom solutions instead of working with standards (like MCP or A2A) will fragment the market before it’s had a chance to start.
Payment: More on this soon. Allow customers to seamlessly pay for services, and developers to get paid. Apple has already done this in their Game Arcade model. Users pay one price and play as many games as they want. Same with Microsoft Game Pass. This isn’t new.
Accessibility: Make sure customers with varied abilities are not left out of this universe. The more flexible the interface, the more inclusive and happy users will be. Elderly, with diminishing abilities could especially benefit (I’m a firm believer in this: (1) and (2)).

Plan for Change: Technology moves fast. The pace is accelerating. Any infrastructure or design that does not allow for evolution is doomed to instant obsolescence.
Standardization: The AI companies of today will be gone in the blink of an eye once their VC money runs out and they do not come up with a self-sustaining business model. Standards help protect users against these inevitable disruptions. Bricking devices or apps will erode user trust. As the saying goes, Trust is like a Bridge: Bridges are Hard to Build and Easy to Destroy.
Security: We are heading into an era where LLMs will have open access to a vast array of personal and private data. Not including security professionals to vet every aspect of design and standardization is unconscionable. OWASP Top 10 Security Risks is a good starting point.

Next: Money.

Title photo by www.testen.no on Unsplash

Part VIII: What Can Go Wrong?

Wearing a Navy Blue Blazer with a Red Tie

AGI

Vaporware

Aspirations

Siri

Bring your app to Siri

Alexa+

Meeting Standards

Big Picture: What’s the Problem?

Ramin.Work

Error

Wearing a Navy Blue Blazer with a Red Tie

AGI

Vaporware

Aspirations

Siri

Bring your app to Siri

Alexa+

Meeting Standards

Big Picture: What’s the Problem?

Templates (for web app):

Error