Part III: Rise of Smart Devices

Posted in voiceassistantalexaecho on AI/ML

Alexa

In 1998, Amazon founder and CEO Jeff Bezos gave a speech at Lake Forest University:

I firmly believe that at some point the vast majority of books will be in electronic form. I also believe that that’s a long way in the future, many more than 10 years in the future. And the reason that that gets held back today is that paper is just the world’s best display device. It turns out that today with the state of the art in display devices, dead trees just make the best display devices. They’re higher resolution, they’re portable, [they’re] high contrast and so some day when computer displays will catch up with that and then I think electronic books will be extremely successful.

A little short of ten years later, in 2007, that prophecy led to Amazon introducing an electronic ebook-reading device: The Amazon Kindle.

The idea of a digital book reader has been around for a while. But the Kindle was innovative in that it offered:

An always-on E-Ink display, offering a week of reading without needing a charge.
A user interface that differed from PCs or phones and focused on distraction-free reading.
A digital bookstore where new books could be discovered and purchased, then wirelessly downloaded to the device and read offline.

When network access was needed (to download more books or, eventually, browse the Internet), Kindle used a new type of network Amazon called Whispernet that used an AnyData EVDO wireless modem. It allowed users to browse for books on-device and download them whenever they wanted without signing up for a cellular service or entering account credentials.

It just worked, and it was included at no extra charge.

Whispernet was the equivalent of building a wireless phone into each Kindle. No one had ever done this before. And if that wasn’t enough, Bezos decided that Amazon would cover the cost of the data plan. While establishing relationships with wireless carriers was difficult, the total cost wasn’t as onerous as the team expected. E-book files were relatively small, resulting in very modest fees. In an interview two years after the first Kindle was launched, Bezos reflected on Whispernet’s role in the Kindle’s success: I believe that’s a big, a big part of the success of Kindle. Because it makes it a seamless device where you don’t even need a computer. And you don’t ever have to hook it up to your computer with a cable. You don't have to fuss with any of that.

Source

There were many other technical innovations:

Amazon’s portable, handheld reader, which allows users to download digital versions of books, newspapers, and magazines, represents one of the first consumer uses of a low-power, easy-to-read electrophoretic display. The $399 device is a breeze to use, and though the company has not disclosed sales numbers, demand quickly outstripped supply. However, the success of the Kindle may depend on consumers’ willingness to bear the price of using it: though e-books, at $9.99, cost less than most physical books, newspapers, blogs, and other content available free on the Internet will cost money (for instance, $1.99 per month for Slashdot and $13.99 per month for the New York Times).

In a 2009 interview, Bezos described the design objectives:

“And the key the key feature of a physical book is that it gets out of the way. It disappears. When you are in the middle of a book, you don’t think about the ink and the glue and the stitching and the paper. You are in the author’s world. And this device, we knew four years ago when we set about to design it, that that was the number one design objective. Kindle also had to get out of the way and disappear so that you could enter the author’s world. And the design team accomplished that.”

Why is the Kindle relevant here?

Because:

It established that connected devices and custom user interactions could appeal to consumers.
All that talk of ‘getting out of the way’ meant that its design was user-centric.
It allowed the functionality to be expanded through purchased ‘add-on’ content (in this case, books, but later, multimedia and apps).

Lab126

In 2004, Amazon established Lab126 in Sunnyvale, California, as a standalone R&D lab to develop consumer hardware.

The actual book software for the Kindle was developed by the French company Mobipocket (hence the file suffix .mobi on Kindle E-book files). The company was purchased by Amazon in 2005.

The concept for a downloadable digital bookstore was

In 2009, Amazon added text-to-voice – what ended up as a controversial feature in the writing and publishing community – to its Kindle 2 e-book reading device. The technology was based on the Nuance system.

Text-to-voice conversion unlocked yet another user-interaction mode.

Ivona

📺 Meet the Ivona Team - Amazon Lab126

📘 Watch on Facebook

In 2002, ‘five guys from Gdansk’ (Poland) were inspired by the movie 2001: A Space Odyssey and set out to create a speech synthesis system they called Ivona. The technology used fragments of actual recorded speech utterances, dynamically reassembled based on the input text. Four years later, they had a system that beat competing voice systems from IBM and Microsoft and many international research projects in head-to-head competition.

Amazon approached the company in 2010, ostensibly to license the technology for a subsequent edition of Kindle. But it ended up outright acquiring Ivona Software. Ivona was a competitor to Nuance and supported 44 voices in 17 languages, which gave it broad international reach. Ivona’s work was instrumental in the development of the Echo AI Assistant and, later, the Amazon Polly voice-generation service.

Watson

In 2011, IBM’s Watson Computer became a Jeopardy TV Quiz show champion, beating two former champions, Brad Rutter and Ken Jennings. This helped smooth the way for public acceptance of computer-based voice recognition and knowledge systems.

Echo

Work on the Amazon Echo (internally called Project D) began in 2011. The first generation was released to Amazon Prime members in 2014. It was a standalone device with speaker-independent voice recognition, the ability to translate voice to text, transform text to user intent, fetch responses, and perform text-to-speech of the result in near-real time.

ℹ️ Side Note

Full Disclosure: I owned one of those early Echo devices and have owned many more since.

In 2019, I joined Amazon Lab126 and worked on technologies related to Echo, FireTV, and Kindle devices. I left to join Amazon Web Services, where I worked on connected devices. Occasionally, I would go back to Lab126 and give seminars on the latest cloud technologies.

Obviously, all descriptions and materials in this series are based on publicly available sources that I have directly linked.

OK, Google

Google’s foray into voice and search goes back to 2007 and the GOOG-411 service. It pre-dated smartphones but used speech recognition to search for data and connect users to local businesses. The ostensible goal of the project was to let Google collect a database of phoneme data that could be used to train machine-learning models. Having accomplished that goal, it was mothballed in 2010.

The technology was incorporated into Voice Search (search Google instead of typing), Voice Input (fill any text field in Android), and Voice Actions (control Android commands). Voice Actions was later renamed Google Voice Search and launched in 2012.

In 2010, Google introduced Voice Actions for Android as a series of speaker-independent voice-activated commands for Android phones.

ℹ️ Side Note

If you’re as confused as I was, please don’t be. Google has a habit of creating many identical-sounding products and services and adding Google to the name. This has much to do with the company’s organizational divisions between Search, Mobile/Android, and Google Labs, which itself was closed after a re-org in 2011.

If it’s any consolation, try to keep track of the ‘Amazon’ vs. ‘AWS’ prefix in the list of AWS Products and Services, i.e., Amazon API Gateway vs. AWS Cloud Control API.

Moving on…

Google Now

In 2012, Google introduced Google Now, based on Google Voice Search and a consolidation of several services. This was to run as a built-in part of the Search service on Android and as a companion app on iOS. What was unique about it was the creation of custom Activity Cards, which were built on top of the Google Knowledge Graph database.

Activity Cards presented various types of data relevant to the context of that information. Visually, a card could be represented with unique formats:

Cards were designed to adapt to different form factors and interaction models, including voice and wearables:

Three years later, in 2015, Google added support for third-party apps, enabling them to generate Custom Cards. This launched with 40 different app integrations, allowing Android users to interact with those services through voice.

This opened the door to…

Google Assistant

It wasn’t until 2016 that Google launched Google Assistant, a whole Siri-like experience.

The technology was integrated into Google Home Smart Speakers (later renamed Google Nest after purchase of Nest Labs in 2014). The Smart Speaker was a direct competitor of the Amazon Echo and functionally identical to the Echo Dot.

Cortana

In 2009, Microsoft began work on their own personal assistant code-named Cortana, named after and voiced by the same actor as a character in Halo, the gaming franchise.

The software was first integrated into Windows 10 in July 2015 and eventually into Windows Phone and the Xbox Gaming Console.

However, unlike Amazon and Google, Microsoft did not create its own branded Smart Speaker device. Instead, it partnered with well-known speaker manufacturer Harman Kardon to make the Invoke. Like Alexa, Cortana also had a third-party skills store.

Leading the Cortana effort was Larry Heck, who had also worked on the SRI CALO project and R&D at Nuance. Heck went on to Google, where he worked on voice recognition for the Google Assistant and then Samsung's own virtual assistant, Bixby.

“The base technologies for a virtual personal assistant include speech recognition, semantic/natural language processing, dialogue modeling between human and machines, and spoken-language generation,” [Heck] says. “Each area has in it a number of research problems that Microsoft Research has addressed over the years. In fact, we’ve pioneered efforts in each of those areas.”

Cortana is tightly integrated with Microsoft’s own Bing search engine, making integrated search directly available to users. Google could also do this, but competitors Apple (Siri) and Samsung (Bixby) did not have their own search engines to tie into. Like the other services, Cortana could also tie into email and calendars via Outlook and Office 365 integrations.

Despite its ubiquity inside the Microsoft ecosystem (and eventually Android and iOS versions), Cortana failed to distinguish itself from all the other assistants. It failed to gather enough third-party and manufacturer support and was eventually retired in 2023.

Bixby and Viv

Samsung announced Bixby in 2017 as a replacement for the S Voice assistant (released in 2012), which itself was initially based on the Vlingo voice recognition system and then the Nuance engine.

What made Bixby unique was that it had early support for individual voices and was integrated inside Samsung’s vast range of products, including cameras and home appliances.

In 2016, Samsung also purchased Viv Labs, a startup begun by Dag Kittlaus and Adam Cheyer, two of Siri’s founders who left shortly after Apple’s purchase of the product from Nuance. Viv had been announced to great fanfare and was purchased by Samsung a mere five months after its first release. It was built on the foundation of integrating with third-party extensions, allowing it to perform multi-part sequences (what we call agentic today).

Viv Labs filed several patents that may be relevant to future agentic integrations:

The patents related to third party developers may have relevance to future attempts to extend AI Assistants through Extensions, something that we will cover in a later section.

A year after the Viv acquisition, Samsung announced that Bixby 2.0 would be rebuilt on top of Viv’s technology and would be headed by (small world) Larry Heck, who had joined Samsung after his stint working on Cortana at Microsoft, and Assistant at Google.

‘M’ is for manual

Not to be left behind, in 2015, Facebook entered the fray by purchasing wit.ai and using it to build a service they code-named M. What made M different was that it could complete a sequence of complex tasks, but when it hit a block or could not complete the steps (reportedly, some 70% of the time), it took a different track:

M is so smart because it cheats.
It works like Siri in that when you tap out a message to M, algorithms try to figure out what you want. When they can’t, though, M doesn’t fall back on searching the Web or saying “I’m sorry, I don’t understand the question.” Instead, a human being invisibly takes over, responding to your request as if the algorithms were still at the helm. (Facebook declined to say how many of those workers it has, or to make M available to try.)

Having humans in the loop to complete tasks was not scalable, and the service was discontinued in 2018 after stating:

We learned a lot.

IN 2019, the NLP service developed by wit.ai would be offered for free to developers to add to business pages.

❗ Side Note

A similar technique would go on to be used by Amazon’s ‘Just Walk Out’ shopping and more recently, the high-flying startup, Builder.ai.

Skills

ℹ️ Note

The following section focuses on Amazon’s Alexa service for illustration purposes. Similar functionality, like Google Actions, Bixby Capsules, or SmartThings Custom Capabilities are available in other assistants.

Alexa’s Skill Kit allows third-parties to extend the functionality of the basic Assistant.

Skills support phrases with blank slots that are filled with a user’s utterances. This allows the system to determine the user’s Intent along with specific parameters. For example, a user asking for weather for a specific location and time might ask for:

Alexa. What’s the weather forecast for San Francisco tomorrow?

Once the voice had been converted into text, a processor would take over to extract what the user was asking for. It would look for a skill designated to handle the Intent weather. That skill would have a list of defined phrases with slots for location and time. The system would then invoke the code for the skill and pass it those named parameters.

Once it had the response, it would return the data in tagged name/values. The result would be converted back to text in the user’s chosen voice and returned.

To create more complex interactions, or if not everything needed to proceed with a task has been collected, you would need to create a complex maze of inter-connected skills and callbacks. Or you can let the Alexa engine figure it out and prompt for the next steps using the Auto Delegation Process

One problem with the skill mechanism is that the list of parameters may vary, depending on the context of what is being said. This is why Dynamic Entity support was added.

The other problem is that if the user says the phrase in a different combination, an unknown form, or with a heavy accent, the request may fail and the user would become frustrated with the product.

This is where modern LLMs help. They are far more tolerant about what a phrase should be, can better handle ambiguity in pronunciation, and can be used to generate much more natural responses that don’t sound like canned text.

Physical World

Alexa skills can be extended to Internet of Things (IoT) devices that can be controlled remotely via WiFi and Bluetooth.

This allows voice control of Smart Home devices and the physical world. There have been other efforts to automate the physical world through projects like Google Home/Google Nest, Apple Home, SmartThings (subsequently purchased by Samsung), and the open-source Home Assistant.

These follow the same general pattern (with some variations), until Project Matter came around. Diving into Matter is out of scope for AI Assistants, but it is a fascinating example of trying to catalog and categorize interactions. We’ll cover Taxonomies and Intents in a subsequent section.

ℹ️ Personal Note 1

I was one of the early backers of SmartThings on Kickstarter. The programming was quirky and more of a DIY thing, but it was one of the first scriptable smart home platforms.

For many years (until it was EOL’d), SmartThings helped save me from heart-attacks by sending push notification every time someone (sneaky spouses and children with quiet feet) walked past a motion sensor outside my home office.

Home Assistant has taken this to a whole new level.

ℹ️ Personal Note 2

As I write this, I have one of each of the following devices wired into my home lab, along with a raft of Zigbee, ZWave, Bluetooth, and Matter sensors and peripherals:

Google Nest Hub

Echo Show

Echo Dot

Apple Homepod Mini

Ikea Dirigera Hub

Home Assistant Hub (on a Pi4)

ESP32-S3-BOX voice assistant

ESP-AVS DevKit

Yes, it is an illness.

ℹ️ Personal Note 3

The urge to add voice to any device would lead to devices like the Alexa-enabled AmazonBasics Microwave Oven.

To this day, I am still baffled by it.

Amazon

Tidying Up

The common through-line in all these assistants was:

Speaker-indpendent voice input

Live Cloud connectivity

Integration with day-to-day services (scheduling, shopping list, etc.)

Access to third-party services and applications

Until now, assistants have had some key limitations:

How well free-form conversations were handled outside of pre-defined slots.

Getting to work with third-party services.

How was the cost of operating the service be paid?

Then Suddenly…

In November 2022, OpenAI announced ChatGPT:

The world of AI Assistants got upended.

Enough history. Let’s dive into the underlying technologies.

Title Photo by Sylwia Bartyzel on Unsplash

Part III: Rise of Smart Devices

Alexa

Lab126

Ivona

Watson

Echo

OK, Google

Google Now

Google Assistant

Cortana

Bixby and Viv

‘M’ is for manual

Skills

Physical World

Tidying Up

Then Suddenly…

Ramin.Work

Error

Alexa

Lab126

Ivona

Watson

Echo

OK, Google

Google Now

Google Assistant

Cortana

Bixby and Viv

‘M’ is for manual

Skills

Physical World

Tidying Up

Then Suddenly…

Templates (for web app):

Error