Researchers at the Massachusetts Institute for Technology (MIT) have created an algorithm called Speech2Face which has been able to, simply based on listening to the sound of different human voices, create rough facsimiles of those human’s physical appearance.
This research was presented on Arxiv, the electronic preprint database for scientific papers not yet sent for peer review, back in May and has already caused something of a stir.
Training the software with millions of educational videos from the internet, featuring over 100,000 different people talking, the MIT team intended for it to learn the association between specific physical and vocal characteristics just from monitoring the sound and appearance of the speakers.
And it appears to have done the trick, Speech2Face being able to listen to short clips of speech and, from these, draw its own images of an excellent facial match. Of course, these matches aren’t perfect, but they hit significant indicators or gender, body size, age, and ethnicity with frightening alacrity.
Part of Speech2Face’s skill as an algorithm can be explained by the fact that it is programmed into a neural network, a computer-modeled on the human brain.
These networks are capable of being presented with information, rather than programmed with a set of instructions, and excel at the task associated with the data fed to them even if no human has ever programmed into them how to carry out that task.
This is how AIs based on neural networks can master the famously complex game of Go, or (less impressively) classify cat photos.
Rather than trying to identify specific individuals, Speech2Face has been taught to recognize characteristics and, as such, the images it creates are of ‘averaged-looking’ faces, based on a mean of aspects like height, weight, age, and gender.
Its portrait of Daniel Craig, taken from his voice, is neither flattering, nor easily confused with Daniel Craig but, made in general terms, Speech2Face was freakishly accurate at capturing the characteristics it has learned to look for.
If this all sounds a bit too uncanny, it should also be noted that humans aren’t terrible at this themselves. A study by Nottingham Trent University in England found that participants made an almost identical analysis of a sample of individuals based on either the sound of their voice or their appearance.
This study built on similar research by Robert Krauss of the US’s Columbia University, indicating that participants were very accurate at guessing the age and height of people merely by their voice alone.
However, a team led by Susan Hughes at the State University of New York went one step further, with research that could be described as ‘rate my voice.’
Hughes’ study suggested that participants were able to very accurately rate the attractiveness of voices that tallied with the symmetricalness of the voice-owners faces.
This might suggest, as the Nottingham Trent study seemed to, that there is something hardcoded into human DNA to make us match our voices to how we present physically, or it may mean, as Hughes claimed, that the voice is a ‘multi-dimensional fitness indicator’ for finding a perfect genetic mate.
However, it may also simply mean that humans have evolved to be very good at picking up on subtle auditory clues regarding body size or type, throat length, height, gender and so on, as they come across in a voice.
But for machines to be able to do this is a refreshing new departure and may present exciting possibilities (and concerns) in the near future. In fact, the MIT team are first in line to point out the potential dangers.
In a post on the GitHub page for the Speech2Face project they address the fears they have in the possible usage of such algorithms, saying “Although this is a purely academic investigation, we feel that it is important to explicitly discuss in the paper a set of ethical considerations due to the potential sensitivity of facial information…any further investigation or practical use of this technology will be carefully tested to ensure that the training data is representative of the intended user population.”
If we are to take some comfort from its weaknesses, Speech2Face was less useful when guessing ethnicity if the speaker switched from Chinese to English (it immediately changed the speaker to make them white) and disproportionately associated high voices with females and low voices with males.
In their defense, the MIT team explained that, due to YouTube being their source text, it didn’t equally represent the population of the world. This issue of ensuring appropriate input to avoid undesirable outcomes from the AI’s output raises ethical questions of its own.
As has been seen already, AIs can find themselves unwittingly reaffirming unjust biases by merely following on the models of practice set by human example.
Ethical issues were also raised, regarding the use of MIT’s source material, by Nick Sullivan (head of Cryptography at Cloudflare), who was featured in one of the many videos Speech2Face was made to view, and whose approximate likeness sprang up in one of the images it created.
Having signed a waiver to appear on YouTube, he wondered whether this waiver then extends to research done using these videos. There is no definitive legal answer yet, but it seems unlikely that AI researchers will stop using open-sourced videos and tweets any time soon.
MIT isn’t the first in the game at linking faces to voices, a group from Carnegie Mellon University have already used similar algorithms with similar outcomes, in research presented to the World Economic Forum.
As such, impetus is clearly behind the race for accurate voice-face recognition, with implications for law enforcement in particular where such software has already been trialed, although with a good deal of controversy around the potential for error, abuse and wrongful arrest.
Perhaps it would be advisable to wear a mask next time you post to YouTube. Or, failing that, just put on a silly voice.