Part II: Hey, Siri

Part II: Hey, Siri

 

Apple Speech-to-Text

In 1984, Steve Jobs introduced the Macintosh by having it speak out loud and crack pre-written jokes.


Apple’s love affair with voice synthesis goes further back, starting with the Apple IIe and the Votrax Voice Synthesizer.

ℹ️ Side Note

The voice of WOPR/Joshua computer in the 1983 movie War Games was not generated by a Votrax, but was a recording of the actor John Wood (who played Professor Falken). He recorded the words by reading them backward. The audio was then post-processed to sound more computerized.

Joshua/WOPR

To understand where voice synthesizers came from, however, we have to travel a little further back in time…

Press * for operator

John E. Karlin Alcatel Lucent

In November 1963, Bell System (later AT&T) introduced Dual-Tone Multi-Frequency - DTMF) under the trademark Touch-Tone. Until then, telephones had used rotary encoders to allow users to enter a sequence of digits. Here’s what a phone dial looked like, kids:

Credit: Dhscommtech at English Wikipedia

Touch-Tone research had been conducted by a team led by Bell Labs industrial psychologist John E. Karlin. Karlin was head of the Human Factors Engineering group – a first of its kind in an American company. Factors like the keypad’s rectangular shape, the order of buttons, and their shape and size had all been considered through meticulous user testing before settling on the final form factor.

Bell Labs

Dr. Karlin, considered the father of human-factors engineering, had an eclectic range of interests. He had obtained a bachelor’s degree in philosophy, psychology, and music and a master’s degree in psychology, going on to earn a doctorate in mathematical psychology. In addition to training as an electrical engineer, he was a professional violinist!

Karlin was also fond of recounting being called "[T]he most hated man in America" for his work on the TouchTone.

DTMF unleashed a flood of creative uses as a ubiquitous Man-Machine Interface accessible to the masses. The technology was also instrumental in bringing together Steve Jobs and Steve Wozniak, future co-founders of Apple.

IVR

In the 1990s, Visioneer and ScanSoft (a Xerox PARC spinoff) were the largest competitors in the sheet-fed document scanning business. And they were adrift. The scanning business was decent enough, but growth had slowed, forcing the two to merge.

The big opportunity was in touch-tone phones and Interactive Voice Response (IVR) services.

Enterprises had an insatiable appetite for ways to convert paper records into digital data and store them in a structured database. Customers could call a phone number, where a synthesized voice would read out options. Users could enter commands using phone buttons and navigate the menu using DTMF tones.

The navigation workflow for IVR systems could get complex, and a large industry sprung up around helping create and manage these interactions.

Sample IVR Flowchart

These applications were dubbed Customer Service Assistants. The applications were self-serve and saved companies a lot of time and money. This was especially true when customers called for everyday tasks like looking for directions, business hours, asking for account balances, or dynamic data like movie times, traffic reports, or current weather.

Tasks that required multiple steps or were too complex to handle were shunted to trained operators, waiting on standby.

In the U.S., calling 1-800 toll-free numbers encouraged customer engagement. But there was also a burgeoning market in 1-900 numbers that charged callers by the minute or transaction.

1-900 numbers proved consumers would pay for services on a per-transaction basis.

The Ghost in the Machine

In 2003, the U.S. military noticed advances in speaker-independent voice recognition technology.

To bring this system to life, DARPA approached SRI International about creating a five-year, 500-person investigation. At the time, it was the largest AI project in history.

DARPA called its project CALO (short for Cognitive Assistant that Learns and Organizes). The name was inspired by the Latin word calonis, meaning “soldier’s servant.”

After a half-decade of research, SRI International decided to spin off a startup called “Siri” (a phonetic version of the company’s name).

Source

Project CALO was based on the PAL (Personal Assistant that Learns) framework from DARPA.

One of the offshoots of the CALO project was the meeting assistant CALO-MA (CALO Meeting Assistant System). CALO-MA was used to digitally assist with business meetings and test natural language and speech processing technologies. The nature of meetings, being multiparty and using domain language, made the system development challenging. The components of CALO-MA included speech recognition software based on the Hidden Markov model (HMM), which was used in another of SRI’s inventions, the DECIPHER project.

Perhaps the most famous CALO descendant is the phone-based digital assistant Siri,

which is now part of Apple iOS but was originally another SRI International development.

The CALO Meeting Assistant System.

The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription, and semantic analysis of multiparty meetings and is part of the larger CALO architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, topic identification and segmentation, question-answer pair identification, action item recognition, decision extraction, and summarization.

Integrating voice control into personal calendars and email was one of Siri’s key selling points.

Enter Vlingo

In 2006, Vlingo of Cambridge, Massachusetts, began offering a voice-recognition system that could be integrated into other products. Vlingo had great success, gaining clients like Blackberry, Nokia, and Samsung phones, as well as most TV platforms.

The combined Visioneer/ScanSoft company acquired another SRI voice spinoff. The combined entities became known as Nuance.

ℹ️ Side Note

SRI had also invented the Acoustic Coupler Modem, an early way to connect computers to the internet.

Voice recognition was a logical extension of IVR/DTMF Touch-Tone input. Why limit yourself to just a dial-pad input if you could convert user requests to text commands (aka Intents)?

Nuance had filed over 4000 patents in the voice-recognition domain, and it now used them to go after competitors like Vlingo, which it acquired in 2011, under threat of patent litigation. It wasn’t pretty, and created a lot of press on the fairness of the patent system.

Nuance itself went on to be acquired by Microsoft in 2022 for $19.7B, based on its medical voice transcription business.

Siri App

Nuance already had an Automatic Speech Recognition (ASR) speech-to-text engine. This was incorporated into early versions of the Siri product and released as a standalone app for iOS in 2010 (preceding the Vlingo acquisition). After the Vlingo purchase, Nuance had two separate voice-recognition products. It could afford to spin one-off. It chose Siri.

A mere two months after Nuance’s Siri app was released on the App Store, Apple announced that it had purchased the underlying technology after Steve Jobs relentlessly pursuing the acquisition.

A year later, it had integrated it into the iPhone 4S and the iPhone operating system. Siri’s new, built-in version was announced at a special event on October 4, 2011.

By then, Jobs was too ill to attend. He died a day later, on October 5, 2011.

The response to the fully integrated Siri was universally positive:

But the honeymoon wouldn’t last long…

Knowledge Navigator and Newton

The positive Siri reviews must have been gratifying, given the high bar set by Apple two decades earlier.

The vision had been rolled out in then-CEO John Sculley’s keynote address at the 1987 Educom conference. This is where he brought up the concept of Intelligent Agents, followed by the concept video for The Knowledge Navigator:

A lesser-known follow-up video was designed by the same team.

Sculley also presided over the introduction of the Apple Newton MessagePad, the first Personal Digital Assistant (PDA) with handwriting recognition, another form of Human Interaction.

Newton’s handwriting system used an Artificial Neural Network character classifier along with Context-Driven Search to perform recognition with minimal prior training. But the results were far from perfect, leading to Apple getting skewered in national media:

Gary Trudeau - 1993 Universal Press Syndicate

The blast radius of the Newton failure had a long-lasting effect. The Siri announcement in 2011 did not acknowledge the Knowledge Navigator despite the direct parallels. Another factor may have been Jobs’ lingering animosity towards Sculley, who went on to oust Jobs from Apple.

Jobs later offered this explanation to his biographer, Walter Isaacson, on his decision to kill the Newton project:

If Apple had been in a less precarious situation, I would have drilled down myself to figure out how to make it [Newton] work. I didn’t trust the people running it. My gut was that there was some really good technology, but it was fucked up by mismanagement. By shutting it down, I freed up some good engineers who could work on new mobile devices. And eventually we got it right when we moved on to iPhones and the iPad.

ℹ️ Side Note

I left my first job in Palo Alto to move to San Francisco, then began commuting to Cupertino to work as a consultant for Apple, working on the MPW C++ compiler.

That opened the door to joining Taligent (a joint Apple, IBM, HP venture) in the mid-90s, building a common operating system that could run on any number of hardware platforms.

I left to start a startup above a swimsuit shop in a Los Altos strip mall alongside three other ex-Apple employees. We built a web browser with custom Animation Markup Language that superseded HTML. This was before Flash took over web multimedia. The company was later sold to Microsoft.

Along the way, I ended up meeting Steve Jobs at NeXT HQ in Redwood City and getting yelled at, mostly about how awful Apple had been to him.

Fun times.

(PS: I wrote about it a few years later)

Speaking of Knowledge Navigator… many of its predictions have come true.

Siri, however, didn’t fair too well:

A decade ago, on October 4, 2011, a remarkable thing happened: Apple launched Siri.

It started off a bit shaky, but with 10 years of technological advancement, it defied all odds. Instead of fixing any of its problems, creating anything new, or actually answering any of our questions with helpful answers, Siri simply maintained. For a decade, it's continued to suck.

Extending Siri

Siri’s abilities were designed to integrate with the iPhone operating system. During a conversation, you could ask the system to set reminders, send text messages, and offer deep integration with Apple’s bundled applications. Anything beyond that, Siri would get hopelessly lost and only recite what it had found on the web.

For the first few years, Siri was a closed system. It wasn’t until iOS 10 in 2016 and the introduction of SiriKit that Siri could be opened up to third parties. The first version of SiriKit only had support for a fixed list of categories.

iOS 10 Release Notes

iOS 10.2

  • Siri now works with the following types of apps:
  • Messaging apps to send, search and read back text messages
  • VoIP apps to place phone calls
  • Photos apps to search for images and photos
  • Ride service apps to book rides
  • Payment apps to make personal payments
  • Fitness apps to start, stop, and pause workouts
  • CarPlay automaker apps to adjust climate, radio, seat, and personal settings

iOS 10.3

  • Support for paying and checking status of bills with payment apps
  • Support for scheduling with ride booking apps
  • Support for checking car fuel level, lock status, turning on lights and activating horn with automaker apps
  • Cricket sports scores and statistics for Indian Premier League and International Cricket Council

In iOS 11, Apple enabled integration through App Intent Extensions

ℹ️ Side Note

Keep this in mind. Extensibility is a big upcoming theme.

With each release of iOS, the quality of voice recognition appeared to be diminishing, leading to Siri’s slow fall from grace:

By 2014, the AI Assistant industry had moved on to standalone devices.

More in the next chapter.


Title Photo by omid armin on Unsplash


© 2025, Ramin Firoozye. All rights reserved.