visit
Adversarial Examples for Humans — An Introduction
This article is based on a twenty-minute talk I gave for TrendMicro Philippines Decode Event 2018. It’s about how malicious people can attack deep neural networks. A trained neural network is a model; I’ll be using the terms network (short for neural network) and model interchangeably throughout this article.
Essentially, a neuron takes a bunch of inputs and outputs a value. A neuron gets the weighted sum of the inputs (plus a number called a bias) and feeds it to a non-linear activation function. Then, the function outputs a value that can be used as one of the inputs to other neurons.
You can connect neurons in various different (usually complicated!) ways and you get a neural network. We call how the neurons are connected the architecture of the neural network. If you have many layers of neuron between the inputs and the outputs, then this is a deep neural network.
When properly-trained, deep neural networks can produce a correct set of outputs given a set of inputs.
Training a deep neural network means we use techniques to get the weights (and biases) for the artificial neurons. Recall earlier, that one of the steps for each neuron is to output the value is getting the weighted sum of the inputs (plus the bias).
Deep neural networks using deep learning techniques are very good at finding patterns given huge amounts of data. Deep neural networks are popular because, in theory, they can learn at different levels of abstraction.It’s like the deep neural network, learned how to map inputs to outputs! That’s why it’s called Deep Learning!
For example, layers near the input could learn simple features like lines and curves of different orientations. The middle layers can use these as inputs to distinguish more complex shapes like face parts (eyes, nose, etc) and the layers near the output could use that information to recognize specific faces based on facial structures.
You can imagine how hard it is for humans to make a face recognition software without deep learning. Because deep learning is great at finding patterns, deep learning could automate tasks that were previously thought as impossible by many.Say we then feed this image to a state-of-the-art model, and it says it’s 70% sure it’s Jessica. In the right of the image below, is another image of Jessica. As you can see, it looks identical to the previous image. When we feed this other image to the same model, is it possible that the model will say that it’s 99% Kris Aquino? Yes, it’s possible!
In 2013, researchers published papers with images like the ones below.
As you can see, each pair of images look identical. However, while the left ones were classified correctly, all the images on the right were each confidently classified as an ostrich. Turns out you can slightly change the input and the model could say that it’s 99% sure that a drastically wrong output is correct. You can slightly change a well-trained model and it could say it’s 99% sure it’s Jessica!
The two funny-looking images below are from a highly referenced 2015 paper. The well-trained model said it’s 99% sure that the left image is a bikini. It’s also 99% sure that the right image is an assault rifle.
There are many weird images in that paper. Below are some of them and all of them are classified with 99% confidence that they’re something specific, like an African chameleon or the number nine.
It seems like these high accuracy models don’t really understand what they’re doing.In 2016, one paper demonstrated how you can trick commercial face recognition software into thinking you are someone else. You just have to wear intentionally-designed fake glasses. While the presented results are not that robust, they’re promising and I imagine it gets better given time like almost all other technologies.
Here’s a video (Nov 2, 2017) from MIT where they tricked a deep learning model that a 3D printed turtle is a rifle for most orientations.
Sentiment analysis could also be used as feedback like how good a company is doing. For example, if autistic Jessica could hack all the private conversations of her officemates, she can give herself a daily satisfaction or trust rating and can use this to improve how she interacts with others.
A 2017 paper demonstrated that by simply replacing certain words with their synonyms, a model could say that a negative sentiment is a positive one or vice versa.
It’s a disaster for autistic Jessica if she makes decisions on how to act based on opposite information!
Here’s a paper about adversarial examples applied to reading comprehension systems: “The accuracy of sixteen published models drops from an average of 75% F1 score to 36%”—In theory, it is also possible to make small changes to a malware file such that it will remain malware but the model will say it is not. This is dangerous and obviously defeats its entire purpose.
According to many experiments including those from Google Brain, they found out that an adversarial input to one network is also most likely adversarial to another, even if the networks have different architectures. They call this property, Adversarial Sample Transferability.
Also, you can create adversarial inputs to any known network today. Many algorithms have been specifically developed for this purpose alone, as we will discuss shortly.
The actual noise added to the image below is designed to fool this specific network. It is not random noise. We can create this noise if we know the architecture and the values of all the parameters (weights and biases) of the network. This is like a “white box approach”.
In normal training, we have fixed inputs and we tweak the parameters in a good direction in order to get a highs score for the correct class. That’s what good training algorithms do.
To create adversaries, one idea is that we flip this process. We have fixed parameters and we can tweak the inputs. We can tweak the original input in the opposite direction a tiny bit until we get a high score for the wrong class. This becomes the adversarial version of the original input. ()
These changes are so tiny such that it’s unobservable to humans but observable to the network. It’s like we’re exploiting the basis of all network training algorithms. A lot of techniques use this idea but there are many other ideas. The point is, it can be done.Jacobian-based Saliency Map Attack (JSMA). L0, L2, L-infinity attacks. Deepfool. Fast Gradient Sign. Iterative Gradient Sign. Etcetera.
The key is to exploit Adversarial Sample Transferability. If we can fool a substitute network, most likely, we can fool the target network we want to attack.
Given the sets of inputs and outputs that we got from the target network, we can use this to train a substitute network. Because we know everything about our substitute network (weights, biases, architecture), we can create adversaries for this network. And because of transferability, these adversarial inputs will most likely also be adversarial to the network we want to attack.
Here’s a recent video from a research paper by MIT researchers where they showcase that their blackbox algorithm is 1000x faster than existing blackbox attack algorithms. (6 Apr 2018, Query-Efficient Black-box Adversarial Examples, Arxiv: 1712.07113)
It’s hard to create a defense against adversarial inputs because networks should produce the correct output for every possible input. Usually, they only encounter a small subset of these inputs.
Some google researchers believe that although neural networks can be highly nonlinear globally, activation functions are almost linear and that is why neural networks are locally linear. They say that almost linear activation functions not only help in training, but also in performing well.
One may also conclude that the model families we use are intrinsically flawed. Ease of optimization has come at the cost of models that are easily misled.It seems like the features that make neural networks effective is also the reason it’s vulnerable to adversarial attacks.
There are also tools to test the robustness of your neural network against adversaries like and .
is named after a horse from a fictional story. They thought the horse could answer math questions. But actually, the horse just learned to read social cues from humans. It is a metaphor for high accuracy neural networks that do not really understand what they’re doing.
Deep learning is used and can be used to automate tasks previously thought to be impossible. Adversarial Examples/Inputs are hard-to-defend attacks against neural networks. It causes the network to make mistakes humans won’t.
If there’s only one thing that you can take away from this article, it’s that along with needing huge amounts of data to correctly answer questions, adversarial examples seem to show that deep neural networks don’t seem to think, learn, or reason on a high-level like humans do.
There are still a lot of work to be done.