Here, we start getting into the technical details of these AI assistants and the underlying technology.
If you arrived here looking for a trip down memory lane and videos, it may be best we part ways. But stick around if you are curious and do not mind peeking into the sausage-making process.
The purpose of the wake word is to prevent the device from having to parse every utterance, wasting bandwidth and processing power Anything said outside the wake word will be ignored.
iFixit
In the iFixit Amazon Echo breakdown, there are seven (7) microphones placed in a circular array. These are used to allow the user to stand anywhere around the device. Noise cancellation and beam selecting algorithms enable sound to be detected even in a noisy environment.
ℹ️
Side Note
There are conspiracy theories that these microphones always listen and transmit all the data to the cloud. The truth is that this would be expensive on the server side and could incur extra costs for customers on metered bandwidth ISP plans. Recording voice data without user consent also has potential legal implications.
Wake words help solve all that.
The trained data for the wake word is loaded on-device and compared to sound data without accessing the network. If there is no match, nothing happens. But as soon as the wake word is detected, the system goes into the next stage which is…
2. Audio Capture
The user’s voice command needs to be captured and converted into a digital stream. The problem is where to store and process that data. Most AI assistants have limited data storage and processing capability. The audio is converted from sound waves into data using a circuit called the Analog-to-Digital Converter (ADC). Various on-device audio processing algorithms clean up the sound to help the detection algorithms work better. A reasonable amount of this digital data is buffered and then continuously streamed down to the cloud.
ℹ️
Side Note
During this stage, the microphones may capture other sounds in a room and send them to the cloud. This could be music, TV, children, pets, friends, or any other inadvertent sound not filtered out by the noise-isolating system.
You may want to be careful what you say once someone has uttered the wake word.
The system has to tell the difference between when the user has stopped talking or just paused (for breath). If there is background noise, the sound may prevent stop detection. At some point, the Assistant needs to stop listening and start recognizing. This is one of the most difficult algorithms to fine-tune when recognizing commands.
Moving on. The stream of data is sent to the cloud service for content detection.
3. Audio Intent Recognition
The core function of an AI Assistant is to suss out what the user is asking for. Without prior training, it needs to perform Natural Language Processing (NLP) algorithms. If you recall from Part I, pre-training was one of the obstacles to early speech-to-text systems.
There are several NLP processing algorithms, but they are mostly variations of some form of Neural Network.
The output of the first step in this process is a string of text matching what the user has said. If the user has a strong accent, speaks in a whisper, or there is much cross-talk, the system might stop at this point and return an error. Depending on the Voice Assistant, it will either politely report a failure or quietly do nothing.
Spot Intelligence
The text of the utterance needs to be turned into a User Intent using an Intent Classification algorithm. Think of this as a black box that accepts a string of commands, analyzes it using a series of Machine Learning algorithms, user preferences, or interaction history (aka, Context), and returns a structured command that a computer program can invoke.
This is another complex problem to overcome since human language is generally vague. Words may have multiple meanings or vary depending on the context of a conversation. For the system to work as expected, it should provide a best guess as to what the user may mean.
Some systems may engage in multiple rounds of query/response to ask for clarification regarding the user’s intent. In that case, the context from the first request may be saved and referenced in the conversation’s second, third, or more rounds.
Either way, the output of this stage will be a User Intent, which may be cast as an Action Verb and one or more Nouns (aka parameters). Some systems only allow a single parameter. Others may try to match the utterance to a series of pre-determined patterns and slots to find the best match.
In 1983, Steve Jobs gave a talk at the Aspen International Design Conference (IDCA). The theme was The Future Isn’t What It Used To Be. During an audience Q&A session, he was asked about voice recognition’s future.
He replied:
‘Voice recognition is about, it’s going to be a better part of a decade away. We can do toy recognition now. The problem is that it isn’t just recognizing the voice. When you talk to somebody, understanding language is much harder than understanding voice. We can sort of sort out the words, but what does it all mean. Most language is exceptionally contextually driven.’
‘One word means something in this context and something entirely different in another context, and when you’re talking to somebody, people interact. It’s not a one-way communication: saying yup, yup, yup, yup voice. It grows when it interacts; it goes in and out of levels of detail, and boy, this stuff’s hard. So, I think we’re really looking at a better part of the decade before it’s there.’
ℹ️
Side Note
The rest of the talk is fascinating too. Jobs brings up a lot of ideas that came to fruition decades later.
4. Intent Execution
The user’s string of words have to be turned into a command, also known as Intent, which can be matched against the list of capabilities known to the system. This may mean looking up data in a database, calling an API (e.g., getting current weather, adding an item to a shopping cart, requesting sports results, setting a timer, etc.)
Depending on the system’s design, the list of operations may be limited by the Intenet Execution system. For example, using a [Domain Specific Language](https://martinfowler.com/dsl.html, to target specific problems.
One of the common side-effects of Intent Execution is that the output is saved back into an internal data structure called Context or Memory as part of a sequence of requests. This allows the Intent Classification system to better react to user requests and the system to discern what the user means when they reference that or this during a series of interactions.
5. Skills
As noted above, Skills extend the functionality of an Assistant. Each skill has to provide sufficient information for the Intent Execution flow to decide whether to delegate control to them.
How to best integrate third-party Skills may be one reason Alexa’s AI Upgrade was delayed. The old skills will no longer work with the new service.
Similar criticisms were leveled against Apple after failing to deliver a new Siri upgrade after last year’s 2024 Apple World-Wide Developer’s Conference demos:
This year, at WWDC 2025, the absence was even more noticeable.
In a subsequent section, we’ll cover why merging general-purpose generative AI and extensible extensions is a complex problem.
But let’s keep moving on. The system has executed the Intent either internally or via a Skill and has a response.
6. Text-to-speech (TTS)
The result of the command getting executed will likely be a data structure with either a status code or some type of data (i.e., OK, weather report, or sports score). Now comes the reverse of NLP, where text is converted back to speech.
Much research has been done to convert text into an audio stream using a choice of voices, intonations, and accents. Whatever the mechanism, we want to correctly pronounce custom phrases (for example, the name of a town, a famous actor, or a sports figure). Many systems also offer a selection of voices with different genders and regional voices. This is where the algorithms are constantly being tweaked and tuned.
In some languages, there are Heteronyms, words that have the exact spelling but are pronounced differently.
In English, words like Bass (pronounced Bay-se as in the musical instrument (electric or classical),
or Bah-ss as in the fish).
The response text has been converted into a binary audio stream and returned to the Assistant in digital form. This is then converted back to analog audio waveforms using a reverse of the ADC circuit, called Digital to Analog Converter (DAC). The sound is then played out of the speaker.
If the user command was to play some music, the device may play back the acknowledgment and then open a separate channel to a music streaming service.
The sequence is done, and the Assistant system can return to its starting state. Some systems may briefly return to listening mode, allowing the user to have a normal conversation without repeating the wake word each time. On others, certain common words like Stop are treated as a combination wake-word/command.
❗
Speed
‘The ideal response latency for humanlike conversation flow is generally considered to be in the 200–500 milliseconds (ms) range, closely mimicking natural pauses in human conversation.’
One way to reduce the cost of an individual Assistant device is to limit the processing and storage needed on that device. An inexpensive device like Google Nest Mini (USD 50 at this writing) or Echo Dot (5th generation) (USD 49.99, on sale for $34.99) will likely consists of only these components:
For a higher-priced device like a flagship mobile phone that costs over USD 1000, ML models designed to fit on-device may be deployed. These models can take advantage of onboard accelerator chips such as Google’s Tensor chip or the Apple Neural Engine.
At WWDC-25, Apple announced a series of on-device Foundation AI Models that would allow many AI workloads to be performed on a Mac, iPhone, or iPad without having to go to the cloud.
This has several advantages regarding data privacy, latency, and networking and bandwidth costs. Google, Samsung, and other premium devices have a similar flow with their own variations of on-device.
The same architecture can also apply to higher-end gateway devices based on technologies from Qualcomm and MediaTek.
The Future is Here
This was a high-level tour of how current AI sausage is made Assistants work. However, things are rapidly changing. There is an expectation that people will be able to talk to ChatGPT in their own voice. OpenAI is supporting Voice Mode and so is Anthropic.
These are highly compute-intensive systems that will not be able to run on-device. However, they allow user interactions to be more flexible and fluid. Users may no longer want to be restricted to the limited vocabulary of a Google Nest, Apple Siri, or Alexa.
Amazon has already announced and is slowly rolling out their next-generation Alexa+ device, now with generative AI support. However, the old skills will no longer work and third-party integrations will need to be rebuilt. There is a good chance these will run only on the cloud, given the strong push to deploy LLM MCP extensions on AWS.
This means full-featured, server-class LLMs and on-device AI Assistants may be coming. They may come from the large vendors listed here or from a completely different direction, given OpenAI’s acquisition of Jony Ive’s joint hardware project.