How AI Could Soon Take Human-Computer Interaction to New Levels

It was a typical Friday afternoon right at the end of a long week of work on our project developing a radically new concept and app for molecular graphics in augmented and virtual reality, when I found myself in a heated discussion with my friend and colleague. He is a "hardcore" engineer, web programmer, and designer who has been in the trenches of web development for over a decade. As someone who prides himself on efficiency and control over every line of code and especially who always has the user and user experience in mind, my friend scoffed at my idea of voice interfaces becoming soon the norm…
"Speech interfaces? They're immature, awkward, and frankly, a little creepy", he said not with these exact words but certainly meaning them, and voicing a sentiment that many in the tech community share. And this was already after having kind of convinced him, maybe by 30–50%, that our augmented / virtual reality tool for molecular graphics and modeling absolutely needs such kind of human-computer interaction because since the users' hands are busy grabbing and manipulating molecules, there's no other way for them to control the program, for example to run commands and such.
More broadly, speech-based interfaces (or Voice User Interfaces, VUIs) can be a game-changer for various work or entertainment situations where hands are busy, and to facilitate accessibility for people with various disabilities whereby together with regular GUIs they would be inclusive even with the visually, auditory, and motion-impaired. All these points make the topic very important to be discussed and evaluated from the viewpoint of technology and UX design, and we must do this often because of how fast technology evolves. Moreover, as I will discuss here, I think it is getting to a point that it can already be pushed for, contrary to my colleague's viewpoint which still remains quite negative.
I do acknowledge, though, that my colleague's concerns aren't unfounded. He argues that speech interaction with computers is still plagued by inaccuracies, a frustrating need to repeat oneself, and a general lack of fluidity. And to an extent, I do know he's right. (But… read on!)
A short but relevant detour: Voice User Interfaces as imagined by Star Trek
While I debated with my colleague about the current limitations of speech-based interfaces / VUIs, I couldn't help but think of Daley Wilhelm‘s articles exploring the future of UX, in particular in her insightful piece titled Did Star Trek Predict the Future of UX?
(and by the way, I also recommend her article Designers: you need to read science fiction)
In her article on Star Trek predicting the future UX, Daley Wilhelm discusses how the VUIs in Star Trek set user expectations for technology, shaping a big chunk of how we interact with our devices today. The seamless, intuitive voice commands that the crew of the Enterprise use to control their ship represent an ideal of what human-computer interaction could be… talking to the computer just like to another human. Star Trek got right the iPads and hand gestures, and even some aspects of multitouch displays, so… did it guess right the future of VUI, too?
The exact same thing, even more advanced, with Lt. Commander Data, a highly sophisticated android from Star Trek: The Next Generation or the Emergency Medical Hologram Doctor from Star Trek: Voyager, both capable of sustaining very complex conversations – and even using speech-based thinking themselves (Further detour: Are human/artificial language models linked to human/artificial intelligence?)
Back to Daley Wilhelm, her key point is that while Star Trek‘s vision of the future was ahead of its time, our real-world technology hasn't quite caught up – at least, not in the way the series imagined. In Star Trek, the crew interacts with the ship's computer largely through voice commands, whether it is to access information, control ship functions, or even replicate food and beverages – yet with limitations as she exemplifies.
This vision of a future where voice interfaces are the primary mode of human/robot/hologram-computer interaction is captivating and, for many like myself, an aspirational goal. And leaving aside my subjective opinion, there are all the advantages I posed in the opening paragraphs.
In Star Trek, the conveyed ability to issue complex, context-rich commands and receive accurate, timely responses seems like a natural extension of technology's potential. For example, Captain Picard could request for a specific flavor of tea, at a specific temperature, and instantly receive exactly what he wanted – no fuss, no misunderstandings. But as [Daley Wilhelm](https://towardsdatascience.com/how-ai-could-soon-take-human-computer-interaction-to-new-levels-ecba1052a496/None) points out, modern voice assistants like Siri, Alexa, and Google Assistant struggle to meet these expectations, and by quite far. Today's users often find these systems falling short of the conversational, context-aware interactions that Star Trek made us dream of. On the other hand, Daley Wilhelm presents an example of Star Trek's computer not really understanding the user, when Geordi LaForge asks the computer for music with a "gentle Latin beat" and the computer initially fails to deliver the exact type of music he had in mind, highlighting the challenges of ambiguity in natural language processing. I quote this specific example from her article because I will come back to it later on in the context of modern (real-world, 2024) technology.
But my point is that the limitations discussed by Daley Wilhelm resonate on first look with many users and developers today, including my colleague. Unlike the seamless interactions depicted on Star Trek, our current VUIs often stumble over complex queries, struggle to understand context, and sometimes return irrelevant or incorrect responses. The reliance on recall, where users need to know exactly what they want to ask or command, contrasts sharply with the more natural recognition-based interaction that users typically expect. Thus, when using modern VUIs we often find ourselves needing to adapt to the technology – learning specific commands or phrasing questions in ways that the system can understand – rather than the technology adapting to us. But my point, to be developed soon below, is that current technology has much more to offer and probably isn't that far.
In particular, note that since Daley Wilhelm published her article, the technological landscape has evolved quite rapidly. For example, by January 2023 when her article was published, OpenAI's first really large and "smart" language model, GPT-3, had just been released a few months earlier – and I tried it through the API, that is before ChatGPT came out, astonished at the possibilities it could open up for more fluid and natural VUIs:
Control web apps via natural language by casting speech to commands with GPT-3
Then, we all know what followed GPT-3: Even more advances in large language models (LLMs) from OpenAI's GPT family, plus Meta's LLama, Google's Gemini, and many others, all appearing increasingly "smarter" every month. AI took over, and it did so through LLMs.
Moreover, the AI revolution also hit the field of speech recognition technologies, with models like OpenAI's Whisper V3 (and simpler-to-deploy wrappers like Gladia‘s) that work much better than models just a couple of years older. Modern speech recognition models like Whisper v3 (and there are more, see here) are capable of not just transcribing a person's oral input but also detecting various speakers (diarization), labeling them, timestamping, automatically detecting and switching language, etc.:
Technologies for speech recognition have been around and even built into some products like Google Chrome for long (allowing programmers to craft speech recognition and also speech synthesis capacities into their web apps very easily!), but the new systems that came out in the last 2–3 years brought the reality much closer to the expectations set by Star Trek.
Same for speech synthesis, which has also existed for some time but, let's be honest, with somewhat creepy voices that have now improved much (and in particular I can't but recommend Talkify.net). I am of the idea that all these advancements are truly beginning to fill the gaps that Daley Wilhelm highlighted, bringing us closer to an idealized VUI experience that relies largely on natural speech.
Moreover, systems that are at advanced stages yet not massively released as of August 2024, such as OpenAI's voice-enabled GPT model that the company rolled out just for a small number of users to test it, have the built-in capacity to process input and output sounds and language at the core of an "omni" model. And to be clear, "sounds" is not only speech, but also any other kind of sound, in principle, thanks to the fact that there's no decoupling of speech recognition, speech synthesis, and language processing models (as one could do earlier and I showed above in one example) but it is really a single "brain" processing everything natively.
Four "technological musts" for VUIs to fit in
Reflecting about VUIs for quite some time and having done many tests myself, I have arrived at a minimum of four key elements required to achieve truly fluid human-computer interaction via speech, that must reach a level of sophistication where they can work seamlessly together. These are speech recognition, understanding, text processing, and speech synthesis. Each plays a crucial role in creating an experience that feels natural and intuitive.
First, speech recognition must be accurate and adaptable. It is not just about transcribing spoken words into text; it is about doing so with a high degree of precision, regardless of the speaker's accent, dialect, or use of specialized jargon. This technology thus needs to be tunable, for example extendable with grammar pertinent to the system's specific applications, and perhaps capable of learning from its interactions with specific users to improve over time by adapting to their accents, word choices, and other nuances. Only then can speech recognition become reliable enough for everyday use, where even slight misunderstandings can lead to significant frustration. Many modern AI-powered systems for speech recognition have such capabilities: they can be prompted to pay attention to specific words or content, fine-tuned to be more sensitive to certain grammar, etc.
Next, there's the matter of understanding: the system's ability to grasp the meaning behind the words. Whether this understanding is genuine (a big thing in itself, more related to Artificial General Intelligence systems, AGIs, and the trillion dollar question about whether we can recreate intelligence in computers) or a sophisticated illusion that passes the Turing test, it must be convincing enough to make the user feel they are having a real conversation. This involves not only parsing individual commands but also understanding context, intent, and the subtleties of human language, such as tone, humor, or implied meanings.
Text processing is equally vital, as it serves as the backbone for input, output, and for the moment for quite a bit of the "thinking". The idea piece of text processing technology must be capable of keeping track of an ongoing conversation, maintaining context, and responding appropriately as the dialogue evolves. Again, it is not just about processing words but rather about doing so in a way that aligns with the natural flow of the conversation, allowing the interaction to feel coherent and relevant throughout.
Finally, speech synthesis must produce output that feels natural to the listener. Synthetic speech should sound in a way that mirrors human speech patterns, including rhythm, intonation, and emotional nuance, to avoid a kind of "uncanny valley of sounds" (check this to learn about this fascinating subject). When speech synthesis reaches a sufficiently human-like level, the interaction will be not much robotic leaving the uncanny valley behind, and enhancing the overall user experience.
When these four technologies are ripe enough and get to work together at their best, the result should be a VUI that we humans really will want to use. Still, the problem of integration with other software and hardware systems will remain if we intend to achieve a VUI like that of Star Trek's computer; however this might never happen due to the problems discussed earlier about privacy and cross-software and cross-device compatibility.
My assessment of how ripe these four technologies are
Speech Recognition: 7/10 Speech recognition has made significant progress especially in the last years, becoming quite reliable in many scenarios particularly in quiet environments and with speakers who have standard accents. However, challenges persist in accurately recognizing speech in noisy settings or when dealing with heavy accents or speech impediments – although I must say some modern models like Whisper v3 are surprisingly good sometimes (but yeah.. failing others..!). While the technology is robust and increasingly effective, it still falls short in certain nuanced or complex situations. In other words, it is barely getting there – as of 2024.
Understanding (and thinking): 8/10 Understanding, particularly through the lens of LLMs, has reached an impressive level. And this applies to several models, probably at least to the top 10–20 in the LMSYS ChatBot Arena‘s leaderboard.
As I will develop below, modern LLMs are capable of handling complex dialogues, grasping context, and generating human-like responses. This capability allows them to mimic human understanding effectively, even in multifaceted conversations.
LLMs can also "think", either directly or by writing and executing code, thus making it quite powerful. Moreover, this feature provides the plug that could interface the VUI to other software and hardware systems for full integration.
Despite this, the technology isn't flawless, especially when deeper comprehension of context or ambiguity is required. Overall, it's quite good, thanks to the advancements in LLMs.
Text Processing: 9/10Text processing, again especially when powered by LLMs, has become highly advanced. These systems can maintain context, manage multi-turn conversations, adapt to varied (even changing) languages, and generate responses that are not only contextually appropriate but also coherent over extended dialogues. The technology excels in understanding and producing text, making it probably the most ready-to-deploy pillar of the four required for fluid VUIs.
Speech Synthesis: 5/10Speech synthesis, while capable of producing intelligible and sometimes emotionally resonant speech, often falls short of sounding truly natural. Many synthetic voices still have a quite "robotic" quality, lacking the subtlety and spontaneity of human speech. This can be particularly noticeable over longer interactions, where the synthetic nature of the voice becomes more apparent. For now, my view is that there's considerable room for improvement before it reaches the level of fluid, natural conversation.
This said, I think that some speech synthesis systems like that built into Chrome or that offered by Talkify.net are pretty decent for VUIs that only need to speak up short sentences.
The role of Artificial Intelligence
You have seen above how my discussion and arguments turned quickly into AI. No wonder.
One of the key issues with traditional VUIs is their limited ability to handle complex conversational language, especially at the stages of speech recognition and of "thinking" to produce a response (then, speech synthesis kind of works more reliably, although it is not very natural yet). As I hinted by the end of the previous section, LLMs are changing the game by providing a level of understanding and context-dependent processing that were previously unattainable, and just unthinkable a decade ago.
Modern LLMs can process and generate language in ways that can certainty make interactions feel more natural and less "robotic". These models "understand" context, recognize nuances in speech, jargon, and content, and can follow the flow of a conversation, all elements that address most criticisms about the current state of VUIs.
For instance, and back to the Star Trek examples, with the integration of LLMs a modern voice assistant could perfectly handle Geordi's request of some music with a "gentle latin beat", actually with much greater sophistication than that depicted in Star Trek for successful human-computer interactions via speech. If you asked for "something with a gentle Latin beat, maybe a Spanish guitar" to a modern LLM, I am sure it will not let you down, not only understanding the request but possibly also following up with something like "Would you like something upbeat or more relaxed?" – thus demonstrating an ability to engage in meaningful, very reasonable conversations. Taking this to the extreme, there are quite good AI models that can generate music… so it is feasible that a program built with 21st century technology could even make up a song for Geordi, fully tailored to his request!
Similarly, advances in speech recognition powered by AI have also significantly reduced the inaccuracies and frustrations associated with voice commands. Systems like Whisper V3, open-source by OpenAI and made available for easy deployment by a number of companies that provide the infrastructure and help to tune and run the model, have made it possible for voice interfaces to accurately transcribe and understand spoken language, even in noisy environments or with accents, with multiple people talking together, and other complications. Moreover, modern AI-based speech recognition systems can automatically detect language, even if it changes on the fly, detect different speakers, assign timestamps, tell among homophones, remove swearing, and also be primed to detect certain words thus allowing to incorporate jargon and local expressions:
Gladia – A review of the best ASR engines and the models powering them in 2024
All these features of modern technologies certainly make speech recognition much more flexible than what assistants like Siri can do today, certainly pushing us closer to the seamless interaction envisioned by Star Trek. Although, I agree, we aren't there yet.
Could the next thing, that is multimodal models or AGIs (as I discussed above briefly about the next GPT model with integrated sound processing), be the last turn of the key?
Closing the gap completely
Leaving aside the fact that speech recognition is almost there but isn't quite there yet, and that speech synthesis barely starts to feel natural in these last years, there is still one big issue that Daley Wilhelm identified and discussed in her Star Trek-inspired article. As she notes, today's VUIs (and I add that even what modern technology can offer today) still lack the deep integration across systems that would allow for the kind of command-and-control functionality depicted in the series. While modern AI models can perfectly understand requests and even "think" about them and proceed accordingly, they still fall short when it comes to actually carrying out tasks the exceed their embodiments. Thus, apparently complex but actually rather simple functions become difficult if not impossible, simply due to the lack of integration of the software with other software running on the device, let alone with other devices – except when specifically coded, as we are doing for our AR/VR app for molecular graphics.
Following on that last point, one way to go beyond a software-only AI models is to provide it with direct physical access to the world, that is essentially embodying it into a robot. This is pretty much what OpenAI is trying to achieve with its "Figure" series of "AGI Robots" where AGI stands for Artificial General Intelligence. Check out this video if you don't know what I'm talking about!
But with such robots, or with any other device deeply integrated with other devices or even for software integrated with the rest of the software in a single device, privacy concerns surface rapidly. These always-on, always-listening devices might be too invasive, and must certainly be regulated – a difficult terrain I won't mess with, but of crucial relevance. Users need to feel secure that their data is protected, and that their devices aren't silently recording everything they say. The issue is thus a big barrier to widespread adoption of VUIs in every aspect of daily life.
So, are we and computers ready for routine interaction via speech?
My take is that humans might not be too ready, but computers are almost reaching the point!
As we continue to refine voice interface technologies, the potential for them to become a routine part of human-computer interaction is increasingly within reach. The advancements we have seen in speech recognition, language processing and speech synthesis are transforming the way we think about interacting with our devices. What was once the realm of science fiction is now edging closer to our daily reality: For the last 5–10 years, many middle-range cars have accepted spoken commands to allow drivers to dial on the phone or pick up calls; just like Siri, Alexa and other home systems allow users to quickly find information online via natural language, ask for music, control lights, and do a few other things. Our phones are becoming increasingly sensitive to our needs and taking inputs via speech commands, as well as reading outputs out loud. Medium, the platform where you are reading this, incorporated around a year ago quite natural speech synthesis system that reads stories for you. And I'm sure you can find more examples from your environment where even basic VUIs are quite useful.
The programming is itself becoming easier, thus allowing developers to incorporate better, larger, more sophisticated VUIs. I showed you some examples earlier, and here you have some more – all with a web twist as I like to do for my projects, as this allows my prototypes and apps to run right away on all devices:
A Web App for Automated E-mail Writing From Voice Notes, Using GPT-3
Note the role of technology integration, which is particularly easy and fruitful in web-based content and web apps:
Coupling Four AI Models to Deliver the Ultimate Experience in Immersive Visualization and Modeling
The future ahead
Reflecting on the trajectory of these technologies, it is clear that while we have not yet fully reached the seamless experience depicted in futuristic visions, we are certainly getting closer. The gap between user expectations and what VUIs can actually deliver is narrowing. With ongoing developments, we might soon see a time when voice interaction isn't just a novelty but a preferred method of engaging with technology – perhaps even in ways that would make the crew of the Enterprise proud!
In fact, I think in fact that technology has already evolved so much in this direction, that computers are now more ready than humans for speech-based interaction.
Maybe in a decade or so, today's virtual assistants like Siri and Alexa or the one in your car will be remembered as the "grumpy grandparents" who started it all. They laid the groundwork for voice interfaces, making it possible to control devices, ask questions, and perform at least some basic tasks using just our voices. The technologies I have discussed here, currently in development but very rapidly, especially the next generation of multimodal LLMs and the anticipated AGIs, hold (I think, or I wish, or I hope…) the promise for VUIs that are not just functional but truly conversational and do not limit themselves to speech but to the right combination of speech- and graphics-based functionality. These systems could, or rather I'm sure they will, handle complex commands, understand nuances in language, and respond in ways that feel natural and engaging.
Human-computer interaction is very close to start evolving, after a long dominance of keyboards, mouse peripherals, and flat screens. And we better embrace it.
A world where interacting with technology through speech could become as second nature as typing or tapping on a screen. From dictating emails to controlling smart homes, the applications are vast and varied. Accessibility will be taken to new levels, allowing for more inclusive software and also meaning less distractions while, for example, driving or working.
Of course, challenges remain. There's still work to be done in improving accuracy, handling diverse accents and dialects, and ensuring that these systems can maintain context over extended conversations. But as we saw above, LLMs and modern speech recognition systems are close to mastering them. Yet, there's the ever-present need to balance technological advancement with user comfort and privacy concerns.
So, are we and computers ready for routine interaction via speech? I believe we are closer than ever, and I'm sure that we can't escape from it for certain applications – such as augmented and virtual reality systems, in applications to assist certain handicaps, in vehicles, in certain industrial settings, you name them.
It is likely only a matter of time before speaking to our devices becomes a common, seamless part of our daily lives. And as developers, designers, and users, we should embrace these changes, address the challenges, and, overall I would say, we must help shape this exciting future.
If you liked this, you will also like…
Exquisite hand and finger tracking in web browsers with MediaPipe's machine learning models
Powerful Data Analysis and Plotting via Natural Language Requests by Giving LLMs Access to…
Low-cost, Low-latency, Customizable Chatbots for Your Websites and Web Apps Using GPT-4o mini
Provocatively, Microsoft Researchers Say They Found "Sparks of Artificial Intelligence" in GPT-4
www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.