Could Hidden Voice Commands Make Hacking Your Phone Easier?

By James O Malley on at

“OK Google, are you hacking into my phone?”

Voice assistants are increasingly ubiquitous in our homes and in our lives, and as with any new technology there is the inevitable worry: How can the bad guys use it?

One potentially terrifying attack vector is the idea of using hidden sounds as a way in.

With normal malware, the way it infects your device is by, say, exploiting a bug or a hole in the web browser to enable an attack which runs “malicious code” that hasn’t been approved by the app or your operating system. Once a hole has been found, all the attack needs to do is figure out how to get you to open up a web page that will launch the dodgy code.

A traditional way to do this is by sending official-looking emails which contain links to dodgy web pages. Unfortunately for the hackers - we’re getting pretty wise to this, and software is getting better at spotting it.

And this is where voice hacking comes in:

“We wanted to generate audio which to a person sounds like noise but to a computer sounds like speech”, says Nicholas Carlini, one of the scientists behind the project.,”The idea is that if we can do this then we can launch attacks that are stealthy to people, [but] that your phone or Amazon Echo or whatever will listen and hear something else.”

Using Google Now, we can order our phones to open a specific web page based on a voice command. But what if it isn’t us issuing the command? And what if the command sounds to humans like indecipherable noise, but is perfectly clear for Google’s algorithms, so it can trigger the Google Assistant without us even realising?

Nicholas’s colleague, Tavish Vaidya, described the origins of the project: “You had all the Siri and Google Now and all that stuff becoming really popular in, like, iPhones, your Apple Watches. And then the question we obviously asked is, like, ‘Okay, this seems pretty awesome, but there doesn’t seem to be any security in it”, he said.

The potential risks could be huge, as using their technique, they can conceivably take any possible voice command and encode it to make it hard to hear. This means your phone could be ordered to do everything from open a dodgy web page to send an email or text, without you realising.

This two step process of receiving the malicious command and opening the web page is important: The good news is that the voice commands on their own don’t contain the danger. “There’s no way to, like, transmit malicious data over the audio channel”, Nicholas explains, “You have to use the audio to make the phone go to some other website and then use the fact that you’re on a website to actually download the malicious data. And it’s important to note that this only works if your phone has a separate vulnerability”.

In other words, as long as you keep your devices patched with the latest firmware updates your devices should be more secure, as discovered vulnerabilities will have been patched. But in a world where a huge proportion of Android devices are not running the latest software, this is far from a given.

How It Works

So how do you actually go about designing a sound that can be heard by machine, but not by us humans? It turns out that the answer is rather complicated.

Essentially though, the hackers tried two different methods: White Box and Black Box. Under the White Box system, they make the assumption that anyone looking to build a voice hack knows how the recognition algorithm works - meaning that the sound can be specifically chopped about to target certain characteristics of the sound wave that the algorithm uses to figure out what words are being said.

With the Black Box method, the assumption was that it was unknown how the algorithm works, so the hackers are at a relative disadvantage. Perhaps one analogy is like building an IKEA shelf: It can conceivably be done without instructions (Black Box), but it’s a hell of a lot easier if you have something telling you how the different pieces of wood and screws should fit together (White Box).


This is exactly what voice hacking looks like.

In both cases, the criteria for building a successful voice hack is the same, as audio will have to satisfy two distinct criteria: Can Google’s Assistant (or Alexa, or Siri, and so on) work out what is being said? (Ideally yes) And also, can a human? (Ideally no). The challenge, then, is to take a waveform of a voice command and strip away as much information from it as possible to make it harder for humans to understand, without damaging the signal that is received by the machines.

To do this, they put a voice command through a piece of software they’ve created called the “audio mangler” - this uses MFCC, a way of visualising and analysing sound data, to figure out what can be removed from the wave. “What we do is we reduce the dimensionality by throwing away information that is not relevant to the speech recognition and only keep information which is relevant, and then add some white noise on top of it”, says Tavish.

Using the Mangler, the scientists can reduce or increase the amount of loss - and then they will generate a number of voice hack “candidates” - possible recordings which appear to satisfy the two criteria.

To figure out which of the candidates is most effective, rather than use trial and error on a live target device, they instead fed the audio directly into Google’s voice recognition API, in order to test what it could cope with.

But what about humans? Surely it is harder to figure out what humans can and can’t detect? It’s here they hit upon a brilliant solution: Mechanical Turk.

Amazon’s Mechanical Turk platform is designed to enable crowd-sourcing on a massive scale: By using a cash incentive for humans to answer problems or solve basic things, it can drop human judgement into other technical problems. So what the scientists did was send the different candidate samples to 400 people using Mechanical Turk, and asked them to describe the voice they heard.

Apparently when testing the samples with the Turk audience, they found that something like 95% of listeners could successfully transcribe the normal “OK Google” recording, but when they used the specially obfuscated recording, this dropped to 22%. And though this might sound high (though one in five is still impressive), this is in part due to the way our brains work: Mechanical Turk participants had been primed before hearing anything to expect words they could transcribe. In a real attack scenario, the obfuscated command would likely just sound like noise to us.

Mitigation

So this is potentially quite a scary new technological threat. If hidden voice commands can be sneakily ordering our devices to do stuff, this could create all sorts of security headaches. Imagine, for example, if one such hidden command was broadcast on TV or over the speakers at a music festival: Thousands of people could simultaneously have their phones compromised… and they wouldn’t even know. So is there anything that can be done?

There is one, obvious thing to do - but it could also be frustrating. “The easiest thing that people can do is that this attack requires that the device is always listening, and so if people just disable the speech recognition, the always listening part, until they actually want to give a command, then there’s absolutely no way that the attacker can get through”, Nicholas explains. But there’s a problem: “Of course, that sort of defeats the entire purpose of the Amazon Echo. But for your phone at least, most of the time when it’s in your pocket you don’t want to be talking to it, and so if you make it so you require some kind of physical interaction with your device, then it makes an attack very very hard.”.

In other words, make sure your device’s voice assistant can only be activated by hitting a button, rather than have it always listening.

But there are also things that can be done by the companies that power our voice assistants too. Most obviously, this includes developing technology to recognise who is speaking - so that your phone will only respond to commands from it’s rightful owner. But this isn’t easy.

“There are a lot of practical constraints that may be limiting it from being an actual deployed solution”, Tavish explains, “It requires training. You need to have enough data on the user’s voice such that every time the user gives an input you can correctly identify the user. And also this creates a usability challenge. Like, you don’t want your user to stand in front of the device for half an hour and train the device.”

So one barrier to making this a reality is the usability problem: Apple, Google and Amazon don’t want a laborious setup procedure. But there’s also a more practical problem: Your voice isn’t like your fingerprint, and it isn’t always exactly the same. Our voices change over time (ask anyone who has moved to another country), and we get sore throats (etc) more often than we scrape the skin off of the tips of our fingers.

And finally, there is one other thing the companies can do, and that’s build in defence systems that sound analogous to conventional hacking and malware defences: Don’t just blindly follow whatever user commands are heard, but instead carry out sanity checks, like making sure that the command being given isn’t a known attack. And if the same command is being heard over and over, it could be a sign that something malicious is happening, and so on.

Ultimately then, it seems like the security threat from hidden voice commands is another tech security arms race waiting for happen. And though the voice commands may be obfuscated, one thing is clear: It is time for industry and customers to start listening carefully to security concerns.

You can read the full academic paper here.