Part IX: Payment

Money, it’s a crime
Share it fairly, but don’t take a slice of my pie
Money, so they say
Is the root of all evil today
One thing that doesn’t seem to come up as much is the…
Cost
Building and running AI Foundation Models comes at a significant cost:

This will likely increase as publishers of data, media, and content begin to realize the value of their content and begin to extract licensing fees for training access.

Step By Step
There are several steps to creating and operating them, each requiring their own complex cost spreadsheet to model.
These include:
- Data Acquisition
- Training
- Distillation/Fine-tuning
- Inference Deployment and Scaling
- Networking
- Storage
- Power and Cooling
- Customer Support
- Re-training
But wait, now there’s also…
- Tools Invocation
- Agentic Workflows
Let’s look at some of these.
Data Acquisition
There are a large number of public datasets to train a model. Once you’ve used all those, they will use classic web-spidering techniques pioneered by early search engines to acquire more content. At some point, you may veer into proprietary content, leading to wholesale blocking of bots and inevitable accusations of plagiarism, or worse, lawsuits.
This may lead to managing access from Cloudflare, TollBit, open-source Tarpits, or my own modest contribution, RoboNope.

If curating datasets for specific domains, the task is even more arduous. Medical, insurance, and data from other regulated domains are difficult to obtain. There are also privacy and regional rules to navigate:

Whatever the case, despite the mounting piles of open-domain text and datasets subsidized by “unrestricted gifts” from Tech companies, the cost of acquiring data is certain to rise. Just ask Meta. Using synthetically generated data may not work so well.
Training

Other estimates point it at heading even higher:
The cost of training frontier AI models has grown by a factor of 2 to 3x per year for the past eight years, suggesting that the largest models will cost over a billion dollars by 2027.
These are not things you can run on your desktop. They require a large bank of hardware. NVidia’s high-end hardware is not for the faint-of-heart – reportedly as much as USD $300K per box.

You can rent time by the hour, or save money by downgrading to the older generation H100s (also reportedly used by DeepSeek in training their model).
And if you think you can get away with just one box, let’s listen to what Mark Zuckerberg has to say:
“We are building an absolutely massive amount of infrastructure to support this by the end of this year. We will have around
350,000 Nvidia H100 or around 600,000 H100 equivalents of compute if you include other GPUs,” Zuckerberg said.
Tweaking
Once a model has been trained, there’s the matter of fine-tuning it for specific tasks. This often involves semi-supervised training which could be labor-intensive and costly.

You can save some time and effort by using a technique called distillation, where you transfer the knowledge from a general-purpose model to a smaller, compressed model. Reportedly, DeepSeek’s R1 model was trained cost-effectively by distilling existing Foundation Models, like Meta’s Llama.
Deployment
When the model is ready, now you enter the realm of
Inference is where you ask a model questions and get answers back. At scale, this often requires multiple large-scale data centers, located as close to customers to avoid network latency.

Each of these data centers would require thousands of NVidia chips, which explains:

The cost of purchasing and operating these GPUs has pushed cloud and AI companies into creating their own, custom-designed chipsets (alphabetically):
Unlike training costs, the cost of inference is not bound. It scales up with the number of users, the workload, and how flexible the infrastructure is when it comes to demand-based scaling.

Even more if your users are polite and were raised to say

One way to reduce these costs is to optimize each step of Inference, starting with splitting user-input into tokens (aka tokenizing).
This is an area where replacing one module, for example, OpenAI’s BPE TikToken Tokenizer with, say TokenDagger could yield significant savings:
There are [other](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/] techniques, but these may require significant changes to underlying architectures. Hardware vendors like NVidia would just as soon sell larger, more performant hardware.

This is an area where some of the cost savings can be passed down to consumers.
Other costs
There’s more to running AI, the cost of:
- Highspeed Networking
- Storage
- Power
- Cooling, and
- Staffing
McKinsey has eye-watering estimates.

But Wait…

On top of all this, we’re now adding the cost of:
- Voice-to-Text
- Text-to-Speech
- Remote MCP calls
- Agentic Workflows
Those costs sure add up. Even something as small as Voice Activity Detection can shave off significant savings at scale.
What if you could skip the whole thing?
On-Device: Pro
The operating costs of running an agent on-device are incomparably less than running inference on the cloud.
But that only works if there is enough processing power on the device. If a significant number of users are not in posession of devices that can do this, you are stuck with a tough product decision:
- Focus on only new users with the latest devices. Cut off owners of older devices.
- Forget on-device. Just do it all on the cloud.
- Create a hybrid service that works for both scenarios.
- Amazon’s Alexa+ service has gone with #1. But strangely, it’s not because these devices have local processing power. Because, based on the specs, they don’t.
- Apple’s Siri 1.0 is clearly #2.
- Siri 2.0 is going with #3, which is probably why it’s delayed.
All the other assistants are waiting to see which version makes more sense. From a financial point of view, #1 makes the most financial sense. Just run it all on-device. This has so many other advantages:
- User privacy (data doesn’t leave the device). This also helps with GDPR/CCPA and other data sovereignty issues.
- Lower response latency. Come on, who doesn’t love that?
- Less need for those expensive data centers, if they’re running inference. They’ll still need them to do all the heavy training, but that cost isn’t infinitely scaling like inference.
- For the many users who are on metered plans, less network traffic.
- Force users to upgrade to the latest, greatest flagship device.
From a device manufacturer’s point of view,
Why isn’t everyone running on-device?
On-Device: Con
To run inference on-device, the user needs to have enough:
- Flash storage: for the LLM. Even small ones run multi-gigabytes.
- RAM: 8 or 16GB isn’t enough any more.
- Shared processor memory (GPU/NPU/TPU).
- Power: running multiple threads of execution could quickly drain the battery and generate heat.
For a home AI Assistant, power is less of a concern, unless it starts racking up noticable amount of usage and it shows up on their home electricity bill.
Regardless of device, the increased Bill of Material (BOM) cost means having to increase the cost to the consumer and perhaps, pushing up to a higher end list of electronic component suppliers, QA, manufacturers, certifications, etc.
What about Google?
From a functional point of view, Google and Apple devices are similar. Apple has the new Foundation Model frameworks. Google has had MediaPipe LLM Inference API. For the sake of brevity, I’ve focused on Apple, but to stay in the good graces of my Google friends, I am duty-bound to point out that Google’s got all the same features (and problems).
Google actually goes one step further, by generously offering MediaPipe running via WASM inside web browsers, a fact that I am taking advantage of in the Project Mango gaming engine.
However, Google will be torn between the Scylla and Charybdis of whether to support on-device inference (and save costs), or collect user-data to train data models.

Based on recent reports, it looks like they have chosen to do both.
Payment Models
How would AI Assistants generate revenue?
Harvard Business School counted up to four business models. Pretty sure we can top that.
Software Assistants
These are software-only assistants that run either inside a custom mobile app, or in a browser. We can consider the cost of hardware: a phone, tablet, or computer, as a sunk cost (aka CAPEX).
Method | Description | Reference |
---|---|---|
Subscription | monthly / annual | Comparing LLM Subscription Plans |
Tiered Subscription | Pro vs. Regular vs. Team vs. Enterprise | Compare AI models pricing side by side |
Advertising | Embedding ads inside chats | Google Places Ads inside AI Chatbots With AI Startups |
Affiliate Links / | Return results with links that generate affiliate fees | Transforming Search and Advertising with Generative AI |
Integrated / API | Allow access to LLMs by bringing your own keys | IBM: LLM APIs: Tips for bridging the gap |
Embedded | Access LLM inside app without providing an API key | Use ChatGPT with Apple Intelligence on iPhone |
Direct | Pay one-time for an app | Lifetime Deals for AI Apps! |
Pay-as-you-go / Transactional | Per-token | Breaking Down the Cost of Large Language Models. Also: LLM API Pricing Calculator |
Tiered Pay-as-you-go | Faster response / more tokens at higher tiers | Cursor: Max Mode |
E-Commerce | Direct recommendation and shopping | A new solution to monetize AI-powered chat experiences |
Subsidized | Other product pays | Try All the Great AI Models and Tools on HIX AI |
Hybrid | Some features one model and premium features at another | How to Monetize AI |
Data Monetization | Re-selling accumulated data to others | |
MCP Access | Any of the above, inside an extension | Bain: Unlocking Hidden Value: A New Approach to Data Monetization with AI |
Hardware Assistants
Hardware assistants offer extended functionality beyond the software/app model. They do not require a separate data plan. They can be inexpensive, offer a shared user-experience, and significantly, be
They may support any of the above models, but also a few others specific to devices:
Method | Description | Reference |
---|---|---|
One-time: Basic | Simple speaker | Wired: Best Budget Speaker |
One-time: Enhanced | Speaker with features like screen, camera, etc. | Wired: Best Smart Display Speaker |
One-time: Super-Enhanced | All on-device | On-Device AI: Building Smarter, Faster, And Private Applications |
Multi-room | Hotel Assistants | Aiello Voice Assistant for Hospitality (AVA) |
Repeater / Mesh | Assistants distributed around home | Thread SmartHome |
Device + Subscription | Pay for devices, then monthly/annual subscription | Josh |
SaaS | Pay for device, get service for free | Siri |
Embedded | device is inside another device, like a vehicle | Stellantis to launch AI-powered in-car assistant |
Hardware Subsidized | Cost of hardware covers service cost | Rabbit R1 - Unlimited AI |
Service Subsidized | Cost of other service covers LLM | Meet the new Alexa - Free with Prime |
Insurance Subsidized | Paid by insurance company | Implementing large language models in healthcare while balancing control, collaboration, costs and security |
Other Revenue Models
There are business models that have yet to be explored. These include:
Method | Description | Reference |
---|---|---|
Dynamic Pricing | Adjusted based on demand | Data-driven dynamic pricing |
Value-based | Benefit provided to the user | Pricing approaches for generative AI applications |
Outcome Based | Based on outcomes from the solution | AI Is Driving A Shift Towards Outcome-Based Pricing |
Generative | Relative to content produced | What is Generative AI Pricing? |
Consulting | Professional Services | Top 10: AI Consulting Companies |
Infrastructure Sub-Leasing | Usage time on platform | Amazon Bedrock |
As a consumer, we may want the lowest-cost option. But if a company can not sustain itself with revenue, they will inevitably go out of business. Think about whether a company’s chosen business model is likely to lead to their continued existence.

By way of example: the Rabbit R1 model ($199 for device + Unlimited AI) sets off so many red flags for me. I wish them good luck, but even if their per-device server/cloud cost is USD 5/month (USD 60/year), after 3 years (USD 180) we are almost at breakeven point for hardware (USD 199 - 180 = 19). If the cost of development and manufacturing of hardware exceed USD 19, the company starts to inch towards the red. This assumes their development, manufacturing, and marketing costs haven’t gone up any in the intervening 3 years.
The purveryors of many LLMs have realized their starting
These usually cost
No matter what, one thing is true…
Title Photo by Dmitrii E. on Unsplash