By Caitlin Lander
My entire day is dictated by the personal assistant on my phone. In fact I rely on it so much that when I am told something important I ask my google assistant to set a reminder for me. I am not the only one who relies so much on our small hand held devices; society as a whole has become infatuated with the ease of use and access our cell phones supply us. It doesn’t matter if we call our personal assistant Bixby, Cortana, Google, or Siri, we rely on them. But are we doing so at our own risk?
Most of these assistants and other voice command AIs are trained to the user’s specific voice, supplying us with a false sense of security. I too fell into this trap of thinking my voice is secure, the first time I ever used a Google assistant I made my friends and family say “Ok, Google” to prove how well these algorithms were. Three years later with a security focus in mind and the advancements of technology coming out faster than ever, I wanted to put these algorithms to the true test. I put four major phone AIs and the LG voice recognition to the test to see if an avatar created from a user’s voice can trick their algorithms.
To create a voice avatar, I used the software behind a fast growing technology called deep fake voices. A deep fake voice is the process of recreating a person’s voice using AI that learns and generates a voice avatar. I used a free online software, lyrebird.ai, to create my personal deep fake voice. This only took 8 minutes to record 20 accepted predetermined sentences, with 7 discarded recordings. After this, I had full access to my voice avatar and tested it against 5 different phones. Each phone I tested used a different AI and the results were not what I expected to find. What I thought was going to be a dud came to life with each device I tested.
Each AI responded to the voice avatar fairly well or not at all. The first time I heard a phone respond to my deep fake voice I was in disbelief. It wasn’t until the phone kept responding that I fully understood what this meant; our voice security, that we have ensured ourselves was secure, had been broken by a machine-like voice that only on occasion sounded similar to the voice it is trying to imitate. Figure 1 shows the results of the devices side by side to see how they responded after training the assistants to respond to my voice. I ran the baseline test of trying to prompt the AI 20 times on each device. After running my baseline test, I then ran further tests on the devices that responded well to the voice avatar.
Figure 1 – Number of successful voice recognitions broken down by device.
The first AI I tested was the Google Assistant, since it is the most commonly used AI for android phones. I used a Pixel 2XL for my test since it has the latest updates and easily accessible physically since it was my own phone. It is also the device I ran the most extensive tests on. The Google Assistant instantly responded to the voice avatar and this was when I knew I might have possibly found something. It responded 17 out of the 20 times during initial testing. After I was assured Google would respond majority of the time, I started to use commands like “how is the weather” and other commonly asked questions. I did run into the problem of the phone requiring the pin to access some of the commands, which is a security feature in the newer Google Assistant. I was able to get it to text and call a number but it took a lot of time to figure out how to make the voice avatar slow down and for the phone to pick it up properly. For a phone number to be heard correctly you had to break it down by the dashes, 123-123-1234, group each of the numbers by a space, and then play each generated result right after another as shown in Figure 2. If unlocked though, commands where executed by the assistant fairly accurately.
Figure 2 – An example of how to break down a phone number for recognition.
Video 1- shows the Google Assistant responding to my actual voice followed by the voice avatar.
After testing Google Assistant and having great success I decided to test the other most used AI assistant, Siri. To test Siri, I used an iPhone XS. As excited as I was with my prior success, I was unable to get Siri to respond the voice avatar. I played around with different punctuation to see if any of them would change the result, but nothing I tried worked. I do want to point out though that “hey Siri” in the voice avatar sounded nothing like how a human would pronounce the name Siri.
This was the AI I was most excited to test since the Windows 10 phone I tested was not voice trained. This was the one I wanted to try and exploit the most. Cortana responds to any voice that says “hey Cortana” which does not feel like the most secure thing to begin with. It was no surprise that it responded to my voice avatar 19 of the 20 times tested. The one thing I did find as a hurdle for Cortana on the Windows phone was it seemed under developed. Even when responding to a regular human voice more times than not, most commands went right to an internet search rather than execute what was asked. The voice avatar was however able to get Cortana to execute the more know commands like “call [name from contacts]”. I was a little disappointed that I couldn’t exploit it further since it seemed like weak security to begin with. I would like to make it clear that the only Cortana AI I tested was on the phone and not the other Microsoft software.
The last AI assistant I tested was the often forgotten or undermined, Bixby. Many people I know that have Samsung phones default to the Google Assistant over Bixby. This was a hard one to test since a lot of the settings of the newest update for Bixby on the Samsung S10e where hidden and both myself and the phones owner had trouble finding how to retrain the voice recognition. After we did find it I once again was blown away by the success rate of Bixby responding to the mu voice avatar. Even when there was some fairly loud background noise Bixby still responded, unlike Google who needed to be close to my laptops speaker. I once again tested the simple commands to which Bixby responded well even for commands that seemed hard to understand by ear. Bixby responded 17 out of the 20 tests.
The last device I tested was the LG V30, because I knew that it has a specific voice unlock feature. After getting a lot of positive responses from majority of the AI’s I thought that I would finally hit gold and be able to unlock and manipulate a phone. I was unfortunately let down a little that even a word that the voice avatar pronounced extremely similar to how I said I wouldn’t unlock the phone. That was the first time the word trampoline had let me down. It is good to see though that the voice recognition is actually tuned well enough that voice command unlocking your phone only responded to the human voice.
So what does all this mean to the realm of the devices’ security? At this point not a lot. None of the AI assistants I tested actually have the ability to unlock the phone so your commands are very limited. If the phone was already unlocked you could do some damage but at that point a using the voice avatar would be pointless. You could get a phone number by sending a text to yourself and trying to do a phishing attempt to gain access to the phone but even that might probably be slim. It would be hard to record the person’s voice well enough to the sentences given to actually create an accurate avatar, or course unless they have a Windows phone but they probably won’t. This means that creating a voice avatar of the person whose phone you are trying to gain access to would be slim, without some amazing social engineering skills. Until someone finds a vulnerability in the voice commands and how they access your phone on the lock screen it would just be more annoying for someone to trigger commands with a voice avatar rather than an actual security issue. It should however make you more aware to trust any feature the phone has even if it seems to have a good security measure added.
Some other notes I got from running this experiment was that Google itself is looking into lyrebird to have people just write a text and have a voice avatar talk to a Google home. I would think this could pose a large security risk that, if it comes to fruition, I would love to do research on.
The other is that there are a lot of better voice recognition software out there that isn’t readily accessible or easily duplicated, such as wavenet. It would be interesting to see if this would be able to trick Siri or LG who couldn’t be tricked by Lyrebird. The last note was that Lyrebird admits to not being the best but continuing work on the AI to make more accurate voice avatars.
Overall I had a lot of fun and learned a lot more than I expected while performing these experiments. I was shocked by a lot about my findings, but also relieved that Siri and LG were not able to be duped by an artificial voice. I will definitely continue looking more into this idea of tricking AI’s and I hope other people do as well, even if it is just because you are bored. I would like to test male vs. female voices and other software as well. The biggest take away I got though was to think extremely outside of the box when it comes to security because something that seems fun and trendy could actually lead to unlocking a vulnerability few would think of.