Many years ago, back in the late 1980s, I worked at a speech recognition startup company. We had some really smart cookies working there and thought that we had continuous speech recognition solved. Which is to say that we created a product that could listen to you speak, output a text transcription of your speech, and that transcription would be accurate with very few errors. Our product ran in near real time, which was quite fast for the day. We did use custom hardware and firmware to do it, which meant that our product was not the cheapest on the market, but it was one of the best.
We would break the speech audio up into phoneme-like chunks, and were using Hidden Markov Models to look at the chunks of speech recognized so far in order to predict the most likely next chunk. We made a few proprietary tweaks to the process to make it faster and smarter, which was our competitive "secret sauce".
But the reality was that we hit a wall, right around 85% recognition accuracy. We could improve things by making speaker-specific models (which users hated, since they would have to spend a lot of time repeating phrases and sentences into a microphone), by training users on careful mic placement, and by using context to constrain the likely words and general grammer to be encountered by our system (such as making a special version for mammography radiologists dictating their x-ray image readings).
For some users, with all these improvements, we would see 95% accuracy or even better. But for most users, things didn't improve that much, and our system was rather frustrating to use, since it was effectively getting one word out of five or ten wrong. Not great for a dictation system, if the user had to go back through the transcription and hand-edit dozens of mistakes. We still made a lot of sales, since our systems were sexy "AI" and were just about the best on the market.
So now here I am playing with Whisper.cpp, this truly remarkable and completely free speech recognition system that runs on a low-end desktop PC (or even a Rasperry Pi 5 with some limitations), runs in almost real-time, works with a cheap microphone that I've stuck in a plastic pineapple, uses the sexy magic of "LLM AI", doesn't need a special speaker-trained model to work, and ... gets around 85% recognition accuracy.
Wow, how familiar.
Basically, in the 40 years since I worked at a speech recognition company, we've:
As for where things are going, I found myself nodding my head at a lot of the points MIT's Rodney Brooks raised during this interview with TechCrunch. Especially the last few sentences of the article, since they relate directly to what I'm aiming to do with Wanda:
Brooks acknowledges that LLMs could help at some point with domestic robots, where they could perform specific tasks, especially with an aging population and not enough people to take care of them. But even that, he says, could come with its own set of unique challenges.
“People say, ‘Oh, the large language models are gonna make robots be able to do things they couldn’t do.’ That’s not where the problem is. The problem with being able to do stuff is about control theory and all sorts of other hardcore math optimization,” he said.
Brooks explains that this could eventually lead to robots with useful language interfaces for people in care situations. “It’s not useful in the warehouse to tell an individual robot to go out and get one thing for one order, but it may be useful for eldercare in homes for people to be able to say things to the robots,” he said.
And maybe, with projects like Wanda, it will be possible to help people with self-eldercare, in a personal, privacy-protecting, and even fun way. That's my hope.