In machine learning, adversarial attacks are techniques used to manipulate or deceive models such that they make incorrect predictions or decisions. They exploit vulnerabilities in the model’s design or training data.

The goal: a small perturbation is chosen on an image such that a neural network misclassifies . We use an optimisation process to minimise the probability correctly classifies , i.e., we choose the that makes the most likely to misclassify the image.

i.e., instead of optimising for the model’s weights, we optimise for .

A non-targeted attack minimises the probability that is correct. A targeted attack maximises the probability that is a chosen target class.

A white-box attack assumes the model architectures/weights are known. We use this information to optimise . A black-box attack don’t have architectures and weights. We use a substitute model that’s known and differential that mimics the target model. Adversarial attacks often transfer across models.

Types of attacks include:

  • Evasion attacks, which manipulate input data to evade detection or classification.
  • Poisoning attacks, which modify training data to compromise the model’s performance. For example, injecting random noise can make a model misclassify.
  • Replay attacks, which use previously recorded input data to manipulate the model. This may be applicable for time-series problems.

Defences

Still an active area of research. We don’t know how to handle them. Notably, these methods fail: