Part IX: Payment

Posted in assistanttaxonomydataadd-on on AI/ML

Money, it’s a crime

Share it fairly, but don’t take a slice of my pie

Money, so they say

Is the root of all evil today

One thing that doesn’t seem to come up as much is the…

Cost

Building and running AI Foundation Models comes at a significant cost:

This will likely increase as publishers of data, media, and content begin to realize the value of their content and begin to extract licensing fees for training access.

Step By Step

There are several steps to creating and operating them, each requiring their own complex cost spreadsheet to model.

These include:

Data Acquisition

Training

Distillation/Fine-tuning

Inference Deployment and Scaling

Networking

Storage

Power and Cooling

Customer Support

Re-training

But wait, now there’s also…

Tools Invocation

Agentic Workflows

Let’s look at some of these.

Data Acquisition

There are a large number of public datasets to train a model. Once you’ve used all those, they will use classic web-spidering techniques pioneered by early search engines to acquire more content. At some point, you may veer into proprietary content, leading to wholesale blocking of bots and inevitable accusations of plagiarism, or worse, lawsuits.

This may lead to managing access from Cloudflare, TollBit, open-source Tarpits, or my own modest contribution, RoboNope.

If curating datasets for specific domains, the task is even more arduous. Medical, insurance, and data from other regulated domains are difficult to obtain. There are also privacy and regional rules to navigate:

Whatever the case, despite the mounting piles of open-domain text and datasets subsidized by “unrestricted gifts” from Tech companies, the cost of acquiring data is certain to rise. Just ask Meta. Using synthetically generated data may not work so well.

Training

Other estimates point it at heading even higher:

The cost of training frontier AI models has grown by a factor of 2 to 3x per year for the past eight years, suggesting that the largest models will cost over a billion dollars by 2027.

These are not things you can run on your desktop. They require a large bank of hardware. NVidia’s high-end hardware is not for the faint-of-heart – reportedly as much as USD $300K per box.

You can rent time by the hour, or save money by downgrading to the older generation H100s (also reportedly used by DeepSeek in training their model).

And if you think you can get away with just one box, let’s listen to what Mark Zuckerberg has to say:

“We are building an absolutely massive amount of infrastructure to support this by the end of this year. We will have around 350,000 Nvidia H100 or around 600,000 H100 equivalents of compute if you include other GPUs,” Zuckerberg said.

Tweaking

Once a model has been trained, there’s the matter of fine-tuning it for specific tasks. This often involves semi-supervised training which could be labor-intensive and costly.

Distillation Photo by Claude Piché on Unsplash

You can save some time and effort by using a technique called distillation, where you transfer the knowledge from a general-purpose model to a smaller, compressed model. Reportedly, DeepSeek’s R1 model was trained cost-effectively by distilling existing Foundation Models, like Meta’s Llama.

Deployment

When the model is ready, now you enter the realm of Inference. This is much like pushing an App onto an App Store so users can start installing and using them.

Inference is where you ask a model questions and get answers back. At scale, this often requires multiple large-scale data centers, located as close to customers to avoid network latency.

Each of these data centers would require thousands of NVidia chips, which explains:

The cost of purchasing and operating these GPUs has pushed cloud and AI companies into creating their own, custom-designed chipsets (alphabetically):

Unlike training costs, the cost of inference is not bound. It scales up with the number of users, the workload, and how flexible the infrastructure is when it comes to demand-based scaling.

Even more if your users are polite and were raised to say Please and Thank You. For those abrought up mal élevé:

One way to reduce these costs is to optimize each step of Inference, starting with splitting user-input into tokens (aka tokenizing).

This is an area where replacing one module, for example, OpenAI’s BPE TikToken Tokenizer with, say TokenDagger could yield significant savings:

There are [other](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/] techniques, but these may require significant changes to underlying architectures. Hardware vendors like NVidia would just as soon sell larger, more performant hardware.

CUDO Compute: NVIDIA H100 versus H200: how do they compare?

This is an area where some of the cost savings can be passed down to consumers.

Other costs

There’s more to running AI, the cost of:

Highspeed Networking
Storage
Power
Cooling, and
Staffing

McKinsey has eye-watering estimates.

But Wait…

On top of all this, we’re now adding the cost of:

Voice-to-Text
Text-to-Speech
Remote MCP calls
Agentic Workflows

Those costs sure add up. Even something as small as Voice Activity Detection can shave off significant savings at scale.

What if you could skip the whole thing?

On-Device: Pro

The operating costs of running an agent on-device are incomparably less than running inference on the cloud.

But that only works if there is enough processing power on the device. If a significant number of users are not in posession of devices that can do this, you are stuck with a tough product decision:

Focus on only new users with the latest devices. Cut off owners of older devices.
Forget on-device. Just do it all on the cloud.
Create a hybrid service that works for both scenarios.

Amazon’s Alexa+ service has gone with #1. But strangely, it’s not because these devices have local processing power. Because, based on the specs, they don’t.
Apple’s Siri 1.0 is clearly #2.
Siri 2.0 is going with #3, which is probably why it’s delayed.

All the other assistants are waiting to see which version makes more sense. From a financial point of view, #1 makes the most financial sense. Just run it all on-device. This has so many other advantages:

User privacy (data doesn’t leave the device). This also helps with GDPR/CCPA and other data sovereignty issues.
Lower response latency. Come on, who doesn’t love that?
Less need for those expensive data centers, if they’re running inference. They’ll still need them to do all the heavy training, but that cost isn’t infinitely scaling like inference.
For the many users who are on metered plans, less network traffic.
Force users to upgrade to the latest, greatest flagship device.

From a device manufacturer’s point of view, this is a no-brainer.

Why isn’t everyone running on-device?

On-Device: Con

To run inference on-device, the user needs to have enough:

Flash storage: for the LLM. Even small ones run multi-gigabytes.
RAM: 8 or 16GB isn’t enough any more.
Shared processor memory (GPU/NPU/TPU).
Power: running multiple threads of execution could quickly drain the battery and generate heat.

For a home AI Assistant, power is less of a concern, unless it starts racking up noticable amount of usage and it shows up on their home electricity bill.

Regardless of device, the increased Bill of Material (BOM) cost means having to increase the cost to the consumer and perhaps, pushing up to a higher end list of electronic component suppliers, QA, manufacturers, certifications, etc.

What about Google?

From a functional point of view, Google and Apple devices are similar. Apple has the new Foundation Model frameworks. Google has had MediaPipe LLM Inference API. For the sake of brevity, I’ve focused on Apple, but to stay in the good graces of my Google friends, I am duty-bound to point out that Google’s got all the same features (and problems).

Google actually goes one step further, by generously offering MediaPipe running via WASM inside web browsers, a fact that I am taking advantage of in the Project Mango gaming engine.

However, Google will be torn between the Scylla and Charybdis of whether to support on-device inference (and save costs), or collect user-data to train data models.

Based on recent reports, it looks like they have chosen to do both.

Payment Models

How would AI Assistants generate revenue?

Harvard Business School counted up to four business models. Pretty sure we can top that.

Software Assistants

These are software-only assistants that run either inside a custom mobile app, or in a browser. We can consider the cost of hardware: a phone, tablet, or computer, as a sunk cost (aka CAPEX).

Method	Description	Reference
Subscription	monthly / annual	Comparing LLM Subscription Plans
Tiered Subscription	Pro vs. Regular vs. Team vs. Enterprise	Compare AI models pricing side by side
Advertising	Embedding ads inside chats	Google Places Ads inside AI Chatbots With AI Startups
Affiliate Links /	Return results with links that generate affiliate fees	Transforming Search and Advertising with Generative AI
Integrated / API	Allow access to LLMs by bringing your own keys	IBM: LLM APIs: Tips for bridging the gap
Embedded	Access LLM inside app without providing an API key	Use ChatGPT with Apple Intelligence on iPhone
Direct	Pay one-time for an app	Lifetime Deals for AI Apps!
Pay-as-you-go / Transactional	Per-token	Breaking Down the Cost of Large Language Models. Also: LLM API Pricing Calculator
Tiered Pay-as-you-go	Faster response / more tokens at higher tiers	Cursor: Max Mode
E-Commerce	Direct recommendation and shopping	A new solution to monetize AI-powered chat experiences
Subsidized	Other product pays	Try All the Great AI Models and Tools on HIX AI
Hybrid	Some features one model and premium features at another	How to Monetize AI
Data Monetization	Re-selling accumulated data to others
MCP Access	Any of the above, inside an extension	Bain: Unlocking Hidden Value: A New Approach to Data Monetization with AI

Hardware Assistants

Hardware assistants offer extended functionality beyond the software/app model. They do not require a separate data plan. They can be inexpensive, offer a shared user-experience, and significantly, be Always On without having to worry about battery power.

They may support any of the above models, but also a few others specific to devices:

Method	Description	Reference
One-time: Basic	Simple speaker	Wired: Best Budget Speaker
One-time: Enhanced	Speaker with features like screen, camera, etc.	Wired: Best Smart Display Speaker
One-time: Super-Enhanced	All on-device	On-Device AI: Building Smarter, Faster, And Private Applications
Multi-room	Hotel Assistants	Aiello Voice Assistant for Hospitality (AVA)
Repeater / Mesh	Assistants distributed around home	Thread SmartHome
Device + Subscription	Pay for devices, then monthly/annual subscription	Josh
SaaS	Pay for device, get service for free	Siri
Embedded	device is inside another device, like a vehicle	Stellantis to launch AI-powered in-car assistant
Hardware Subsidized	Cost of hardware covers service cost	Rabbit R1 - Unlimited AI
Service Subsidized	Cost of other service covers LLM	Meet the new Alexa - Free with Prime
Insurance Subsidized	Paid by insurance company	Implementing large language models in healthcare while balancing control, collaboration, costs and security

Other Revenue Models

There are business models that have yet to be explored. These include:

Method	Description	Reference
Dynamic Pricing	Adjusted based on demand	Data-driven dynamic pricing
Value-based	Benefit provided to the user	Pricing approaches for generative AI applications
Outcome Based	Based on outcomes from the solution	AI Is Driving A Shift Towards Outcome-Based Pricing
Generative	Relative to content produced	What is Generative AI Pricing?
Consulting	Professional Services	Top 10: AI Consulting Companies
Infrastructure Sub-Leasing	Usage time on platform	Amazon Bedrock

As a consumer, we may want the lowest-cost option. But if a company can not sustain itself with revenue, they will inevitably go out of business. Think about whether a company’s chosen business model is likely to lead to their continued existence.

Pay only once (aka Lifetime) models have fallen out of favor for both software and hardware. These do not account for increasing future costs and almost inevitably end up in unhappiness.

By way of example: the Rabbit R1 model ($199 for device + Unlimited AI) sets off so many red flags for me. I wish them good luck, but even if their per-device server/cloud cost is USD 5/month (USD 60/year), after 3 years (USD 180) we are almost at breakeven point for hardware (USD 199 - 180 = 19). If the cost of development and manufacturing of hardware exceed USD 19, the company starts to inch towards the red. This assumes their development, manufacturing, and marketing costs haven’t gone up any in the intervening 3 years.

The purveryors of many LLMs have realized their starting USD 20/month fee is not enough to cover their high training and deployment costs. They are now experimenting with new revenue models like Tiered Subscription (see above) plans, like:

These usually cost USD 200-250/month and are targeted at pro-sumers. Enterprise or Team rates will likely be negotiated on a case-by-case basis.

No matter what, one thing is true…

Coming up Next…

Title Photo by Dmitrii E. on Unsplash

Part IX: Payment

Cost