Adversarial attacks

In machine learning, adversarial attacks are techniques used to manipulate or deceive models such that they make incorrect predictions or decisions. They exploit vulnerabilities in the model’s design or training data.

The goal: a small perturbation $ϵ$ is chosen on an image $x$ such that a neural network $f$ misclassifies $x + ϵ$ . We use an optimisation process to minimise the probability $f$ correctly classifies $x + ϵ$ , i.e., we choose the $ϵ$ that makes $f$ the most likely to misclassify the image.

i.e., instead of optimising for the model’s weights, we optimise for $ϵ$ .

A non-targeted attack minimises the probability that $f (x + ϵ)$ is correct. A targeted attack maximises the probability that $f (x + ϵ)$ is a chosen target class.

A white-box attack assumes the model architectures/weights are known. We use this information to optimise $ϵ$ . A black-box attack don’t have architectures and weights. We use a substitute model that’s known and differential that mimics the target model. Adversarial attacks often transfer across models.

Types of attacks include:

Evasion attacks, which manipulate input data to evade detection or classification.
Poisoning attacks, which modify training data to compromise the model’s performance. For example, injecting random noise can make a model misclassify.
Replay attacks, which use previously recorded input data to manipulate the model. This may be applicable for time-series problems.

Defences

Still an active area of research. We don’t know how to handle them. Notably, these methods fail:

Adding (adversarial) noise at training or test time
Ensemble models (averaging many models)
Weight decay
Neuron dropout

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

Adversarial attacks

Defences

Graph View

Backlinks