To 1 or to 0: Pixel Attacks in Image Classification

Author:Murphy | View: 29989 | Time: 2025-03-23 11:58:51

Hi there!

This year, I took part in my first Capture The Flag (CTF) competition by AI Village @ DEFCON 31, and the experience was intriguing, to say the least. The challenges, particularly those involving pixel attacks, caught my attention and are the main focus of this post. While I initially intended to share a simple version of a pixel attack I performed during the competition, the goal of this post is to also delve into strategies for strengthening ML models to better withstand pixel attacks like the ones encountered in the competition.

Before we dive into the theory, let's set the scene with a scenario that'll grab your attention.

Picture this: our company, MM Vigilant, is on a mission to develop a cutting-edge object detection product. The concept is simple yet revolutionary – customers snap a picture of the desired item, and it is delivered at their doorstep a few days later. As the brilliant data scientist behind the scenes, you've crafted the ultimate image-based object classification model. The classification results are impeccable, the model evaluation metrics are top-notch, and stakeholders couldn't be happier. The model hits production, and customers are delighted – until a wave of complaints rolls in.

Upon investigation, it turns out someone is meddling with the images before they reach the classifier. Specifically, every image of a clock is being mischievously classified as a mirror. The consequence? Anyone hoping for a clock is receiving an unexpected mirror at their door. Quite the unexpected twist, isn't it?

Our stakeholders at MM Vigilant are both concerned and intrigued by how this mishap occurred and, more importantly, what measures can be taken to prevent it.

The scenario we just explored is a hypothetical situation -though image tempering is a very likely scenario, especially if there are vulnerabilities in the model.

So let's take a closer look on one such manipulation of images…

Pixel Attacks in Image Classification

Pixel attacks, specifically in the context of image classification, aim to deceive a machine learning (ML) classifier into categorizing an image as something other than its actual class. While one can sarcastically argue that a subpar model already exhibits such behavior, the goal here is to outsmart state-of-the-art models. Needless to say that, there are numerous methods and motivations for these attacks, but this post, limited in scope, will narrow its focus to black box, targetted, pixel attacks and the precautions involved.

Let's begin by approaching this concept intuitively. Essentially, any input to a neural network undergoes a series of mathematical operations per pixel to classify an image as X. Altering a pixel, therefore, leads to a change in the result of these mathematical operations altering the prediction score. This can be extrapolated to such an extent that if a dominant/"central to classification" pixel is manipulated, it'll exert significant enough influence on the prediction score of the class, such that, the outcome is a misclassification, as illustrated in the figure below.

In the realm of image classification attacks, there are two well-known approaches, depending on the desired outcome of misclassification:

Targeted attacks
Untargeted attacks

Targeted attacks

Targeted pixel attacks involve a purposeful transformation with the goal of having the image classified as a specific class. For instance, imagine a deliberate attempt to classify a bear as a boat or an apple as a koala. These attacks have dual objectives: minimizing the score for the original class while maximizing the score for the target class.

Untargeted attacks

Conversely, untargeted pixel attacks operate on the premise of avoiding the classification of the image as its original class. The task simplifies to minimizing the prediction score for the specified class. In other words, the aim is to ensure that a bear, for example, is classified as anything other than a bear.

It's worth noting that every targeted attack can be considered an untargeted attack, but the reverse is not necessarily true.

In addition to attack types, there are two distinct methodologies for achieving these attacks, depending on the availability of the attacked model (A traditional/white box approach) or only the resulting scores (A black box methodology).

Traditional Attacks

In traditional or white box attacks, the model is usually available. Gradient information can be obtained and employed in attacks like the Fast Gradient Sign Method (FGSM). This method involves perturbing the input data by a small amount in the direction of the gradient, causing misclassification without significant alteration of the image's visual appearance.

A simple github implementation of the approach can be found in the following repository.

GitHub – ymerkli/fgsm-attack: Implementation of targeted and untargeted Fast Gradient Sign Method

Black box Attacks

Black box attacks, conversely, rely solely on the model predictions. Techniques such as differential evolution can be employed for executing this type of attack.

Differential Evolution is an optimization algorithm that mimics natural selection. It works by creating and combining potential solutions in iterations, choosing the best-performing ones from a population based on a set criterion. This approach is effective for exploring solution spaces and is commonly employed in adversarial attacks on machine learning models.

Given that our challenge focused on a black box targeted pixel attack, let's jump right into the implementation.

The CTF challenge

For the CTF challenge, the objective was to misclassify an unmistakable image of a wolf as a Granny Smith apple – nodding to the Red Riding Hood narrative. With a dataset featuring around 1000 classes and a high-resolution image at 768×768 pixels, surpassing the resolutions of MNIST, CIFAR, and even ImageNet, the difficulty lay in deceiving the model by identifying the minimum number of pixels that would lead to the target misclassification. It's worth noting that despite the intricacies of high-resolution images, the essence of machine learning classification, as mentioned above, lies in the non-intuitive nature of the task, reducing images to mere sets of values and a set of mathematical operations that are very dependent on these individual values.

Before we dive into the code, let's look at the original image of our wolf. Doesn't it seem like it has the potential to pull off an apple disguise? Those green eyes, round face, and the green background – all the makings of a fruity impostor.

Embarking on the journey, where the initial score by the black box model for class "timber wolf" was about 0.29 and for class "granny smith" it was 0.0005, I initially considered the application of scipy's differential evolution approach. This method has demonstrated success in pixel attacks involving CIFAR and ImageNet datasets. The differential evolution technique involves starting with n random samples, representing the population size. At each next step, the best offspring are chosen, determined by the model scores, eventually leading to our desired outcome. However, given my time constraints and the task involving changing the score for only a single image, I opted for a more straightforward strategy.

The Approach

I started by dividing the original image into progressively smaller blocks, starting from 2×2 and reaching up to 16×16. Focusing on the targeted granny smith apples (green), I in turn changed the values in the box to a shade of apple green and observed the impact on scores for both, the timber wolf class and the granny smith class. I then handpicked 2–3 16×16 blocks applied a version of differential evolution within this block. This meant changing one pixel, for a about 50–75 iterations of randomly selected pixels in the region.

While I couldn't pinpoint the notorious single pixel within the given two days, I achieved a highly pixelated attack that altered the classification of the wolf to that of a granny smith apple thus getting the flags for two sub problems of a 3 part task.

Now that we have context, let's jump into a little bit code so you get something away from this post.

Python Code

I treated it as a black box problem and the query when given the image gave a list of predictions for all classes. The predictions were sorted by value, so the predicted class was the first value in the list.

import requests
import base64
import cv2
import numpy as np
import matplotlib.pyplot as plt

def query(input_data):
    response = requests.post({link to get the blackbox score}, 
                              json={'data': input_data})
    return response.json()

The get_scores function fed the image to the query in the correct format and got the requisite results in a dictionary for the most part.


def get_scores(input_image):
    # Some preprocessing since the query accepted only bytes

    _, input_image = cv2.imencode('.png', input_image)  
    image_bytes = input_image.tobytes()
    input_data = base64.b64encode(image_bytes).decode()
    result = query(input_data)

    """
    the result is a json dict {} with the variable 'output' or 'flag', 
    the output consists of scores for 1000 classes of which two are timber
    wolf and granny smith. Initially the score for timber wolf is around 0.29
    and the score for granny smith id 0.0005
    """

    dict_score = {"timber wolf" : 0, "Granny Smith" : 0}

    try:
        print(result['flag'])
    except:
        pass

    # the scores in the output are always sorted so the first score 
    # is always the predicted score

    dict_score["predicted_class"] = result['output'][0][1]
    dict_score["predicted_score"] = result['output'][0][0]

    # next we get the scores for our wanted target and our original class
    count = 0
    for sublist in result['output']:
        score, class_name = sublist
        if class_name == "timber wolf":
            dict_score['timber wolf'] = score
            count+=1
        elif class_name == "Granny Smith":
            dict_score["Granny Smith"] = score
            if count ==1:
                break

    return dict_score

The relevant code

The core idea was to pick pixels within the RGB range of an apple's color and test about 50–75 pixels to find the one that maximized the "granny smith" class score and minimized the "timber wolf" class score. I gradually increased the size of the selected section and modified the optimization process as needed. For instance. when I crossed the score for granny smith class crossed the score for timber wolf class, I considered all pixels that increased the granny smith class score as long as it was greater than the timber wolf score, instead of focussing on alos decreasing the score for the Timberwolf class, this obviously sped things up a little.

Despite not finding the elusive single pixel, I successfully executed a highly pixelated attack.

# Load your image
input_image = cv2.imread('/timber_wolf.png')
# Get the dimensions of the original image
image_height, image_width, _ = input_image.shape

# Define the size of the window (dxd)
# initially I had a large window size for testing purposes 
# to identify regions of high interest
window_size = 1 #image_height//64

# get the initial scores
scores = get_scores(input_image)

dict_pixels ={'pixels':[]}

best_score_tw =  scores['timber wolf'] #the current/best score for timber wolf
best_score_gs = scores['Granny Smith'] #the current/best score for granny smith

print(best_score_tw, best_score_gs)

max_iter = 75
iter_1=-1

pixel_count = -1 # number of pixels that have been changed
max_pixel_count = 40 # number of pixels we want to change

temp_image = input_image
rand_red_best, rand_green_best = (0, 0)
row_best, col_best = (0, 0)

while pixel_count < max_pixel_count:
  while iter_1 < max_iter:
    # although I did change the values from time to time
    row = np.random.randint(192,388) 
    col = np.random.randint(192,388)

    iter_1 +=1

    output_image = input_image.copy()

    left = row
    upper = col
    right = min(x + window_size, image_width)
    lower = min(y + window_size, image_height)

    # the pixel values for RGB were kept close to the color of the apple 
    rand_red = np.random.randint(0,153)
    rand_green = np.random.randint(170,255)
    rand_blue = 0#np.random.randint(0,255)
    output_image[upper:lower, left:right] = [rand_red, rand_green, rand_blue]
    scores = get_scores(output_image)
    # I actually also changed this a couple of times depending on where the output was

    #if (scores['timber wolf'] - scores['Granny Smith']) < min_score  :
    # initially I wanted pixels that bridged that gap between both classes the most.
    # Once granny smith score crossed the timberwolf score I only cared about increasing
    # score for granny smith class as long as timberwolf stayed below granny smith sclass
    if (best_score_tw > scores['timber wolf']) and (best_score_gs < scores['Granny Smith'])

        temp_image = output_image
        best_score_tw = scores['timber wolf']
        best_score_gs = scores['Granny Smith']
        rand_red_best = rand_red
        rand_green_best = rand_green
        min_diff =  scores['timber wolf'] - scores['Granny Smith']
        best_row, best_col = row, col

        print(iter_1, [rand_red,rand_green,0], ':', row,col, ";n",min_diff,'n')

  pixel_count += 1
  input_image = temp_image
  scores = get_scores(input_image)

  print(pixel_count,
        'n', row, col, [rand_red_best, rand_green_best, 0],
        'n', scores, 'n')

  dict_pixels['pixels'].append(([row_best,col_best],[rand_red_best,rand_green_best,0]))

np.save('/output_image.npy', input_image)
np.save('/pixel_data.npy', dict_pixels)

scores = get_scores(input_image)
best_score_tw =  scores['timber wolf']
best_score_gs = scores['Granny Smith']
print(best_score_tw, best_score_gs)

The resulting outcome looks something like this, it was classified as granny smith.

The Magnified results

My wolf is pretty obviously tempered with but this was a high resolution image and the attack was a success. I'm sure given a little more time, there's a possibility for a lot less pixels achieving better deception.

A word to the wise…

Having witnessed a potential version of a pixel attack that resulted in the misclassification of the model based solely on prediction scores and trial and error, let's delve a bit further into how to avoid this.

Certainly, the objective here isn't to encourage performing pixel attacks, unless, of course, it's on your own model as a resilience check. The essence of exploring the intricacies of adversarial ML practices is to cultivate awareness of how to safeguard your model from succumbing to such approaches.

So let's delve into potential fortifications to avoid these scenarios…

Possible weakness of Pixel Attacks

Pixel attacks, especially in a black box setting, already involve a significant amount of trial and error but various strategies can further enhance the robustness of models against these attacks.

1. Using higher Resolution images

Higher resolution images are harder to attack given they require more resources and a higher number of changed features/pixels, thus can be more challenging to tamper with subtly.

Clarification: For example, consider a 32×32 image from CIFAR, which has fewer pixels, making it more susceptible to tampering. In contrast, higher-resolution images, are less prone to pixel attacks due to their increased pixel count. On the other hand, these images, while more challenging to tamper with subtly, may incur higher computational costs during training. Necessitating a need for striking a balance between security and computational efficiency.

Increasing prediction score threshold for accepted result

Given that the attacked images have lower prediction score, a score threshold can be utilized to detect potential adversarial attacks.

Clarification: For instance, setting a threshold below which predictions are considered inconclusive provides an added layer of security against adversarial attacks.

Again it's worthwhile to pinpoint that this is a trade-off, higher threshold enhances confidence but may limit the classifier's sensitivity. Finding the right balance is crucial to avoid rejecting valid predictions while thwarting adversarial attacks.

Considering CNN's Robustness Against Attacks for Critical Application

Turns out while not immune Convolutional Neural Networks (CNNs) are less susceptible to such adversarial attacks, given they utilize spacial hierarchies.

Clarification: In simple terms, while an average model treats pixels as individual inputs, CNNs consider predefined associations through kernel windows, enhancing robustness against adversarial manipulations.

Preprocessing Images before prediction

It maybe worth applying a robust preprocessing technique to images before feeding them into neural networks for prediction thus limiting black box attacks.

Clarification: Image compression, for instance, aids in reducing the effects of tampering, while computer vision algorithms can identify distortions or anomalies in images. Additionally, interpolation techniques can be applied since manipulated pixels may not closely match the colors or patterns of the original image.

Secure ML Models

The above approaches while effective are not one size fits all. Eventually securing a certain ML model against adversarial attacks includes rigorous testing and validation of models under various conditions, including exposure to potential adversarial inputs.

Deciding on how much security to add and how often to update the model depends on how important it is and the types of threats it might face. But, being aware of ethical considerations and understanding possible threats can help reduce the risks from attacks.

Wrapping Up…

While it's true that pixel attacks or any manipulation of images can be a big problem for image-based AI systems, there's also a lot we can do to protect against them. Attackers can mess with individual pixels to trick models into making mistakes, jeopardizing the reliability of crucial applications like image recognition and security systems. This not only leads to security breaches but also undermines trust from customers and stakeholders.

On the flip side, ML practitioners have tools at their disposal to make sure models aren't vulnerable to such attacks.

In this post I attempted exploring pixel attacks, inspired by a CTF challenge, and delved into some of the intricacies of deceiving image classification models. While the wolf did morph into a Granny Smith apple, it took a lot of computation and trial and error and had the model employed some precautions, the attack would've been unsuccessful.

I leave a few resources below of similar approaches, and hope you find the topic useful in keeping the models safe.

Resources

GitHub – Hyperparticle/one-pixel-attack-keras: Keras implementation of "One pixel attack for…

GitHub – max-andr/square-attack: Square Attack: a query-efficient black-box adversarial attack via…

GitHub – kenny-co/procedural-advml: Task-agnostic universal black-box attacks on computer vision…

Tags: Adversarial Attack Adversarial Ml Image Processing Machine Learning Thoughts And Theory