In the era of artificial intelligence (AI), strong computer vision dominates many research efforts in academic labs, tech giants, and startups. We strive to enhance object and image recognition algorithms so that AI systems can accurately recognize people and objects in images and videos, and in more advanced stages, understand what is going on in a scene. Currently, even state-of-the-art object and image recognition algorithms struggle to perfect this–something that comes so effortlessly to humans.

Perhaps the problem lies in the way that machines learn to recognize images. What if another type of sensory data (audio in this case) were incorporated into machine learning to enhance its recognition rate? After all, young children learn new words through both viewing pictures and sounding out words. Learning multiple representations of an object provides a stronger, joint portrayal and distinguishes it from other objects. Humans and machines can benefit from these enhanced representations. Although AI system development strongly benefits from machine vision developments, building other sensory capabilities provides strong benefits as well. To better understand why, it is important to step back and take a brief look at the perceptual human brain.

Humans rely on sensory information (e.g. vision, audition, touch, smell and taste to name a few– yes there are more) in order to navigate their environment. Over the last 30 years, extensive evidence from brain research shows that the brain is multimodal, or multisensory (for a review, see The Handbook of Multisensory Processes). There is ample evidence that humans often use more than one sense to perceive objects and there is a benefit from multimodal inputs. Speech perception has been used countless times to demonstrate this.

Imagine you are standing inside a busy café with a friend. You are trying to continue the conversation you started when you walked in but find it difficult amidst the roaring espresso machines and baristas calling out drink orders. Before you entered the café you glanced at the vibrant daffodils at the entrance but now find yourself intently concentrating on your friend’s face (and lips) to follow what she is saying. Despite the noise level, you are miraculously able to understand virtually every word. This example shows how seeing a talking face can enhance the audio signal when it is degraded.

As audiovisual input aids human communication, it can also help machines understand and process speech. How many times have you tried to ask your phone’s voice assistant to do something only to wind up saying (in exasperation), ‘Nevermind’? Sure the average voice assistants are decent at responding to simple requests like, ‘remind me to turn on the dishwasher tonight’ or ‘call my friend Sarah.’ Accuracy dramatically decreases with more complicated requests which humans find very simple. Building a system which benefits from learning the sound of your voice as well as how your lips move when you speak will reflect more biologically plausible processes and aid human-computer communication and interaction.

An AI system that features multisensory input is likely to overcome training challenges more quickly than a unimodal system and avoid technical limitations of a unimodal framework. Using a training set that includes both images and audio clips linked to the same object provides stronger associations for the AI system to learn. By training the system on more data per data object, this can reduce the number of training errors and lead to a faster path to accuracy.

Multisensory perception has been applied in the area of social robotics but it can cover much more ground including human-computer interactions and other AI capabilities.

In recent years some research has addressed multimodal input. In 2000 Wachsmuth and colleagues modeled the interaction of image and speech processing by using multimodal input to represent visual scenes. Over decade later Viciana-Abad and colleagues tested an interactive robotic system which detected and tracked speakers using multimodal information, specifically audio and visual signals.

AI systems with biologically inspired mechanisms and sensory-like qualities allow for a broader range of capabilities. An AI system built on a multisensory framework has profound implications not only for human-computer interaction but also high-risk decisions such as medical diagnoses and legal procedures.

At DimensionalMechanics, we embrace this notion in our development of a scalable, artificial general intelligence platform. We integrate a diverse set of physical and mental properties into the platform to enhance its ability to form associations and form intuitions like the human mind. Ultimately, for machines to understand humans, machines will need to understand and interact with our inherent multisensory nature.


Viciana-Abad, R., Marfil, R., Perez-Lorenzo, J.M., Bandera, J.P., Romero-Garces, A., & Reche-Lopez, P. (2014). Audio-visual perception system for a humanoid robotic head. Sensors, 14(6), 9522-9545.

Wachsmuth, S., Socher, G., Brandt-Pook, H., Kummert, F., & Sagerer, G. (2000). Integration of vision and speech understanding using Bayesian networks. Journal of Computer Vision Research, 1(4), 62-83.

Follow me on Twitter @artsci00

Related Posts

Leave a Reply

Privacy Preferences
When you visit our website, it may store information through your browser from specific services, usually in form of cookies. Here you can change your privacy preferences. Please note that blocking some types of cookies may impact your experience on our website and the services we offer.
%d bloggers like this: