Part I: Genesis

Posted in voiceassistantiphoneandroid on AI/ML

Chapter 1: Can You Elaborate?

Joseph Weizenbaum hated how his invention, ELIZA (aka DOCTOR), had been cast as an intelligent human therapist.

In his letter to Edward ‘Father of Expert Systems’ Feigenbaum in 1970, Weizenbaum wrote:

The thinking I’ve done so far tells me that the earlier ELIZA system (while canned in some respects) is grossly inefficient in many others. Its total ignorance of grammar, for example, cannot be underestimated. Also, the structure of ELIZA scripts (e.g., absence of threads) argues from an entirely new perspective, i.e., from that of belief structures.

That did not stop the media from dramatizing the public reaction.

Seven years later, in Computer Power and Human Reason, Weizenbaum pleaded:

The shocks I experienced as DOCTOR became widely known and “played” were due principally to three distinct events.

A number of practicing psychiatrists seriously believed the DOCTOR computer program could grow into a nearly completely automatic form of psychotherapy. […]

I was startled to see how quickly and how very deeply people conversing with DOCTOR became emotionally involved with the computer and how unequivocally they anthropomorphized it. […]

Another widespread, and to me surprising, reaction to the ELIZA program was the spread of a belief that it demonstrated a general solution to the problem of computer understanding of natural language. In my paper, I had tried to say that no general solution to that problem was possible, i.e., that language is understood only in contextual frameworks, that even these can be shared by people to only a limited extent, and that consequently even people are not embodiments of any such general solution. But these conclusions were often ignored.

The reality was much more prosaic.

Eliza used a basic pattern-matching system to extract the user’s intent and repeat back a variation of the question they asked. Its talent was reframing what you posed as a question as if the answer was at the tip of its tongue – if you could just divulge a little more…

Despite not knowing grammar rules, it correctly adjusted the tense and subject of a response to give its discourse the aura of an educated conversationalist.

What made it different was that it could pick out core nuggets of what you had entered, remember them, and regurgitate them later in the conversation. What in today’s parlance may be called context and memory.

If you asked Eliza a complex question, tried to challenge it, or reached a conversational dead-end, it simply ignored what you were saying and flipped back to a previous point as if recalling a memory. It was a neat parlor trick as if it was paying attention and remembering what you had told it, like a Good Listener.

💡 Is something troubling you?

Feel free to try a modern re-implementation of ELIZA.

ELIZA was an attempt at codifying the famous Turing Test (aka Imitation Game) to see if a computer could fool a human. But anyone spending more than a few minutes with ELIZA could see the repeating patterns and its failure to meaningfully answer questions.

Weizenbaum famously decried how this could be conflated with true intelligence:

[O]nce a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induce understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible.

ℹ️ Side Note

My first encounter with ELIZA was in the early 80s, running on a Digital VAX 11/750 computer (by then, the size of a half-height small refrigerator). It was easy to find the logical holes in the program and quickly get it into a loop (reminiscent of modern Coding Assistants and Dead Loops).

The pattern-based scheme, however, was my inspiration for developing a PoetryBot. It used Chomsky’s Grammar to generate (literally) reams and reams of poetry, spit out on Z-fold paper off a DECWriter teletype dot-matrix printer. A primitive grading system tried to assess whether the output was sound. It often failed.

Any sane human would gauge its output as utter crap, but every once in a while, there were genuine surprises. I wish I had kept some of the output, but paper was expensive, and the flip side of a page could be used to print TPS Reports.

The AI world was obsessed with experimenting with Lisp’s symbolic support and going down the rabbit hole with the works of Minsky, Papert, and McCarthy.

And then… suddenly, the first wave of AI was declared dead, and we entered The AI Winter.

Everyone doing research moved off their Symbolics and LMI Lisp Machines and on to the first wave of Virtual Reality devices (also funded by Mavin Minsky, who surely recognized a new trend when he saw it).

Marvin Minsky in the early 1980s. RGB Ventures/SuperStock/Alamy

I’m Sorry, Dave…

The AI Assistant that captured the public imagination arrived a mere two years after ELIZA, in 1968, in the form of HAL 9000. The evil sentient computer in Stanley Kubrick’s seminal 2001: A Space Odyssey (and the novel of the same name by Arthur C. Clarke) was exciting and unnerving. It clearly contradicted Isaac Asimov’s benevolent Three Laws of Robotics:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.

A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

HAL sparked the imagination of a generation of computer scientists and roboticists who looked past its ultimate fate.

It helped that the science advisor on the film was none other than Marvin Minsky, the eminent co-founder of the CSAIL / MIT Artificial Intelligence Laboratory.

Voice Recognition

For decades, voice recognition had been considered a key component of understanding humans. Let’s take a walk down memory lane. After all:

“Those that fail to learn from history, are doomed to repeat it.” – Winston Churchill

Audrey (1952)

Audrey used custom voice recognition hardware built by AT&T Bell Labs to recognize the spoken digits ‘0’ to ‘9’. It had to be trained on a specific person’s voice, but amazingly, it was 97% accurate. The circuitry was the size of a room hidden behind that wall.

IBM Shoebox (1962)

Shown at the IBM Pavilion at the Seattle World Fair. The device could recognize 16 words, including numbers.

PARRY (1972)

Short for Paranoia – simulating a Paranoid Schizophrenic patient. Based on the same core technology as ELIZA.

At one point, PARRY had a head-to-head conversation with DOCTOR that might seem familiar to today’s observers of online discussion groups:

[PARRY] *I used to go to the track every week.

[ELIZA] What does that suggest to you?

[PARRY] *I avoid racetracks now because the races are fixed.

[ELIZA] Is that the real reason?

[PARRY] *I’d rather not discuss it.

[ELIZA] What else would you like to discuss?

[PARRY] *That’s something I don’t want to go into.

[ELIZA] Does that trouble you?

HARPY (1972)

Harpy’s test results were impressive:

An even deeper dive that includes several other contemporary systems showed Harpy to be a formidable bit of tech:

Dragon NaturallySpeaking (1975)

This was one of the first commercial continuous speech-to-text recognition systems, but it did need speaker-specific training. It initially had no semantic knowledge of the content and operated based on matching speech patterns.

As it happened, James K. Baker, a rising force in the world of speech recognition technology, was finishing his PhD thesis at Carnegie Mellon during the DARPA-funded research boom. In his landmark 1975 dissertation, “Stochastic Modeling as a Means of Automatic Speech Recognition,” Baker explored the uses of Hidden Markov models to recognize words from unrecognized sounds. This foundational research led to the first commercially viable speech recognition software.

In 1982, Baker and his wife, Janet MacIver Baker, formed Dragon Systems Inc. In 1997, Dragon released the first consumer-grade commercial voice recognition product, Dragon NaturallySpeaking. This software’s selling point was that for the first time in decades of speech recognition research and development, the user did not need to speak haltingly with unnatural pauses for the benefit of the machine. Dragon’s software was the first to process continuous natural speech and remains in use today.

Source

The company had to go through a series of acquisitions and mergers, starting with Lernout & Hauspie, then ScanSoft and Nuance, before finally landing at Microsoft.

Incidentally, the product is still available today.

Jabberwacky (1982)

Jabberwacky (variation on the Lewis Carroll poem) began as a conversational program on a Sinclair ZX81, but it evolved over time to learn from human conversation.

“It will then start to have a home in physical objects, little robots that are a talking pet.” […]
"If I have my way, people will be walking around, sitting, cooking and more with one on their shoulder, talking in their ear."
Rollo Carpenter - Creator of Jabberwocky

You can try a modern version for yourself.

Talking Moose (1986)

Talking Moose was an early Mac companion that popped onto the screen, narrating what was happening on the system with humorous quips. It used MacinTalk text-to-speech technology and made a good novelty demo to show friends. What made it especially unique was that it could access system menus and windows.

This is where an assistant encroaches into the enclosing operating system, a feature Apple later added to Siri and incorporated into iOS and MacOS.

Dr. Sbaitso (1992)

Developed by Creative Labs (of Sound Blaster fame) to show off the capabilities of their PC Sound Cards, it was one of the first chatbots to marry ELIZA-style interactions with text-to-voice output.

Dr. Sbaitso was later turned into a game. There is an emulated version you can try.

ALICE (Artificial Linguistic Internet Computer Entity) aka Alicebot (1995)

ALICE was a rule-based chatbot, famously inspiring the 2013 Oscar-nominated Spike Jonze movie Her. The movie featured Scarlett Johansson as the AI Chatbot Samantha. A decade later, emulating her voice would cause a legal dust-up between Johansson and OpenAI.

Microsoft Bob (1995)

Microsoft was looking to simplify the PC’s user experience and make it more user-friendly. Bob was a re-imagining of the operating system interface. It featured several animated assistant characters, including a dog called Rover.

BOB was based on Microsoft Agent technology, which incorporated speech recognition, text-to-speech, and access to Office and Windows environments.

ℹ️ NASA Digital Library

Experimenting with human-machine interfaces was big in the mid-90s. Trying to break out of the all-too-familiar office desktop metaphor or industrial dashboards chock full of sliders and gauges.

Back then, I worked at NASA Ames Research Center to devise a way to search for and access large amounts of satellite imagery, day-to-day images, and video data. A sort of Digital Library, if you please.

The Proof of Concept was a live, 3D-rendered Digital Library with wood-grain media racks, a librarian’s desk, printers, and CD writers. You could walk through virtual stacks and look for, say, GEOS Satellite Images or historical lunar data. You could always search by text keywords, but the idea was to capture the serendipity and discovery that comes with walking through a 3D VR space and then turning around to find something interesting.

However, the budget for a full rollout was (in retrospect, sensibly) denied by NASA HQ. 🤷🏻‍♂️

Clippy (1997)

Clippy was Microsoft’s attempt at integrating an embedded assistant to help new users unfamiliar with Microsoft Office. Clippy used the same Microsoft Agent technology as Bob and unfortunately faced similar criticisms. Clippy, however, was foisted onto millions of standard Windows computers running Office 97, much like U2’s Songs of Innocence was crammed onto iTunes without users asking for it.

⚙️ Dots

Under its animated agent interface, Clippy worked as an extension to Microsoft Office. We’ll cover these in-depth later.

Prody Parrot (1999)

A Windows assistant that flew around the screen, squawking as it read aloud messages and offered to help with desktop tasks.

⚙️ Side Note

Notice a pattern? Most of these were variations on embedding an assistant inside the operating system to help with complex tasks.

What we might call agentic today.

SmarterChild (2001)

This was an add-on Instant Messenger bot for AIM and MSN. It used conversational AI alongside real-time data like stocks and weather.

💡 External Services

This was the first time an assistant managed to integrate with external services and receive fresh data. One of the use cases for MCP today.

But then again…

The Second AI Winter

https://media.makeameme.org/created/Brace-yourself-Winter.jpg

It was around 2001 that AI Assistant technology took a nearly decade-long break. The Internet started taking off, and as we all know, the iPhone dropped and shook up the tech world.

ArnoldReinhold, CC BY-SA 4.0 via Wikimedia Commons https://creativecommons.org/licenses/by-sa/4.0

ℹ️ iPhone Announcement

I was at MacWorld Expo 2007 in San Francisco when the iPhone dropped. It set me off on a decade-long run of mobile apps and hardware adventures. Once the AppStore opened to third-party apps, I got a booth at the MacWorld Expo and ended up with a Best of Show award and an AppStore front page that paid off all the development costs.

The Appstore was a means for adding approved binary extensions (apps) to a running system (iOS). Let’s keep that in mind.

On the Assistant front, there wasn’t much going on. Winter and all…

Meanwhile…

Text-to-voice and voice-to-text technology in the 1980s needed to get much, much better.

And they did.

Enter DECTalk, a standalone hardware device where you could send a string of text (with embedded ‘escape’ characters) via serial port commands. This could change voices, intonation, timing, pauses, and other variables. The technology behind it was hardcore fascinating.

DECTalk was useful to the burgeoning Interactive Voice Response (IVR) market. Enterprises were looking to save call center costs and allow customers to use their touch-tone phones to navigate phone menus on their own. IVR applications would accept the user input (in the form of digits 0-9, * and #), look up something from a database, fill in a template, and have DECTalk speak it back in a human-sounding voice.

It was also used by the National Weather Service and, famously, Stephen Hawking.

Good text-to-speech was a necessary step on the path to having Assistants like Alexa respond in a natural voice.

ℹ️ Side Note

My first job out of college was near Page Mill Road and Foothill Expressway in Palo Alto, down the street from Xerox PARC. I would regularly bump into people from Hewlett Packard, SRI, and DEC-WRL (DEC Western Research Lab). This was an era of technical seminars at Stanford, book readings at Printer’s Inc. bookstore, and later, drinks at The Varsity, the Oasis, or Antonio’s Nut House.

My day job was to work on new, industrial Man-Machine Interfaces. Being a DEC shop, we got access to early-release versions of DECTalk. We spent weeks going deep into pronouncing technical terms and custom utterances. The documentation wasn’t that great, and it was a slow slog, full of trial and error, especially when it came to Vocal Tract Models.

On weekends, I moonlighted as a bartender at the Varsity Theater on University Avenue in downtown Palo Alto. There was live music and a bar where you could have conversations about anything from the music of Michael Hedges (who played there regularly) to the nature of the cosmos, and applications of new tech.

One of my favorite memories was overhearing several patrons at the bar mentioning their work at DEC-WRL down the street. I’m sure they were amused to have their bartender grill them about the inner workings of DECTalk. The next workday, those tips helped me figure out the problem.

Coming up next… Part II: Hey, Siri.

Part I: Genesis