Build a Convolutional Neural Network from Scratch using Numpy

As Computer Vision applications are now present everywhere in our daily lives, it is fundamental for every Data Science practitioner to understand their functioning principles and familiarize themselves with them.
In this article, I built a Deep Neural Network without relying on popular modern deep learning libraries like Tensorflow, Pytorch, and Keras. I then classified images of handwritten digits with it. While the achieved results didn't reach state-of-the-art levels, they were nevertheless satisfactory. Now, I want to take a further step in developing a Convolutional Neural Network (CNN) using only the Python library Numpy.
Python deep learning libraries, like the ones mentioned above, are extremely powerful tools. However, as a downside, they shield Data Science practitioners from understanding the low-level functioning principles of Neural Networks. This is especially true with CNNs, as their processes are less intuitive compared with the classical fully connected networks. The only way to address this issue is to get our hands dirty and implement CNNs ourselves: this is the motivation behind this task.
This article is intended as a practical, hands-on guide rather than a comprehensive guide of CNN functioning principles. As a consequence, the theoretical part is concise and mostly serves the understanding of the practical section. For this reason, you will find an exhaustive list of resources at the end of this post. I warmly invite you to check them out!
Convolutional Neural Networks
Convolutional Neural Networks use a specific architecture and operations that make them well-suited for tasks related to images, such as image classification, object localization, image segmentation, and more. Their design roughly mirrors the human visual cortex, where each biological neuron responds to only a small portion of the visual field. Moreover, higher-level neurons react to the outputs of lower-level neurons.
While classical fully connected networks can handle image-related tasks, their effectiveness degrades significantly when applied to medium or large images due to the large number of parameters they require. For instance, a 200×200 pixel image contains 40,000 pixels, and if the first layer of the network has 1,000 units, it results in 40 million weights just for that layer. This challenge is highly alleviated by CNNs as they implement partially connected layers and weight sharing.
The main components of a Convolutional Neural Network are:
- Convolutional layers
- Pooling layers
Convolutional Layer
A convolutional layer consists of a set of filters, also known as kernels. When applied to the input of the layer, these filters modify the original images in specific ways.
A filter can be described as a matrix, whose elements values define the king of modification applied to the original image. For instance, a 3×3 kernel, like the following one, highlights the vertical edges of the image:

This kernel instead accentuates the horizontal edges:


It is important to note that the values of the elements of these kernels are not manually chosen but are parameters that the network learns during the training process.
The main function of convolutions is to isolate and highlight the different features present in the image. Later on, dense layers will use these features.
Pooling Layer
Pooling layers are more simple than convolutional layers. Their purpose is to minimize the computational load and memory usage of the network. They achieve this task by downsizing the input image's dimensions. Reducing the dimension results in a reduction of the number of parameters that the CNN has to learn.
Pooling layers also employ a kernel, typically of dimension 2×2, to aggregate a section of the input image into a single value. For example, a 2×2 max pooling kernel extracts 4 pixels from the input image and outputs only the pixel with the maximum value.
Python Implementation
You can find all the code shown in this section in my GitHub repository.
GitHub – andreoniriccardo/CNN-from-scratch: Convolutional Neural Network from scratch
The concept behind this implementation consists of creating Python classes that represent the convolutional and max pooling layers. Furthermore, as this CNN will be applied to the famous open-source MNIST dataset, I also create a specific class for the Softmax layer.
Within each class, I define the methods that perform the forward propagation and backpropagation steps.
As a final step, the layers are appended into a list to build the final Convolutional Neural Network.
Convolutional Layer Implementation
The code defining a convolutional layer is the following:
class ConvolutionLayer:
def __init__(self, kernel_num, kernel_size):
self.kernel_num = kernel_num
self.kernel_size = kernel_size
self.kernels = np.random.randn(kernel_num, kernel_size, kernel_size) / (kernel_size**2)
def patches_generator(self, image):
image_h, image_w = image.shape
self.image = image
for h in range(image_h-self.kernel_size+1):
for w in range(image_w-self.kernel_size+1):
patch = image[h:(h+self.kernel_size), w:(w+self.kernel_size)]
yield patch, h, w
def forward_prop(self, image):
image_h, image_w = image.shape
convolution_output = np.zeros((image_h-self.kernel_size+1, image_w-self.kernel_size+1, self.kernel_num))
for patch, h, w in self.patches_generator(image):
convolution_output[h,w] = np.sum(patch*self.kernels, axis=(1,2))
return convolution_output
def back_prop(self, dE_dY, alpha):
dE_dk = np.zeros(self.kernels.shape)
for patch, h, w in self.patches_generator(self.image):
for f in range(self.kernel_num):
dE_dk[f] += patch * dE_dY[h, w, f]
self.kernels -= alpha*dE_dk
return dE_dk
The constructor of the ConvolutionLayer
class takes as inputs the number of kernels of the convolutional layer and their size. I assume to use only squared kernels of size kernel_size
by kernel_size
.
Later, I generate random filters of shape (kernel_num, kernel_size, kernel_size)
and, for normalization, I divide each element by the squared kernel size.
The patches_generator()
method is a generator. It yields the portions of the images on which to perform each convolution step.
The forward_prop()
method carries out the convolution for each patch generated by the method above.
Finally, the back_prop()
method is responsible for computing the gradient of the loss function with respect to each layer's weight. It also updates the weights' values correspondingly. Note that the loss function mentioned here is not the global loss of the network. Instead, it consists of the loss function passed by the max pooling layer to the previous convolutional layer.
To demonstrate the actual effect of this class, I created an instance of the ConvolutionLayer
with 32 filters, each of size 3×3. Then I apply the forward propagation method on an image, resulting in an output consisting of 32 slightly smaller images.
The initial input image has size 28×28 pixels and is depicted below:

Once I applied the forward_prop()
method of the convolutional layer, I obtain 32 images of size 26×26 pixels. One of them is the following:

As you can see, the image has been reduced in size, and the clarity of the handwritten digit is worse. It is important to note that this operation was carried out by a filter containing random values, and therefore, it does not accurately represent the actual step performed by a trained CNN. Still, you can grasp the idea of how these convolutions yield smaller images where the distinctive features of the object are isolated.
Max Pooling Layer Implementation
I used Numpy to define the Max Pooling layer class as follows:
class MaxPoolingLayer:
def __init__(self, kernel_size):
self.kernel_size = kernel_size
def patches_generator(self, image):
output_h = image.shape[0] // self.kernel_size
output_w = image.shape[1] // self.kernel_size
self.image = image
for h in range(output_h):
for w in range(output_w):
patch = image[(h*self.kernel_size):(h*self.kernel_size+self.kernel_size), (w*self.kernel_size):(w*self.kernel_size+self.kernel_size)]
yield patch, h, w
def forward_prop(self, image):
image_h, image_w, num_kernels = image.shape
max_pooling_output = np.zeros((image_h//self.kernel_size, image_w//self.kernel_size, num_kernels))
for patch, h, w in self.patches_generator(image):
max_pooling_output[h,w] = np.amax(patch, axis=(0,1))
return max_pooling_output
def back_prop(self, dE_dY):
dE_dk = np.zeros(self.image.shape)
for patch,h,w in self.patches_generator(self.image):
image_h, image_w, num_kernels = patch.shape
max_val = np.amax(patch, axis=(0,1))
for idx_h in range(image_h):
for idx_w in range(image_w):
for idx_k in range(num_kernels):
if patch[idx_h,idx_w,idx_k] == max_val[idx_k]:
dE_dk[h*self.kernel_size+idx_h, w*self.kernel_size+idx_w, idx_k] = dE_dY[h,w,idx_k]
return dE_dk
The constructor method only assigns the kernel size value. The following methods operate similarly to the ones defined for the convolutional layer, with the main difference being that the back_prop()
method doesn't update any weight values. In fact, the pooling layer doesnt' rely on weights to perform the aggregation.
Softmax Layer Implementation
Finally, I define the Softmax layer. It has the objective of flattening the output volume obtained from the final max pooling layer. The Softmax layer outputs 10 values, which can be interpreted as the probability of an image corresponding to the 0-to-9 digits.
The implementation has the same structure of the ones seen above:
class SoftmaxLayer:
def __init__(self, input_units, output_units):
self.weight = np.random.randn(input_units, output_units)/input_units
self.bias = np.zeros(output_units)
def forward_prop(self, image):
self.original_shape = image.shape
image_flattened = image.flatten()
self.flattened_input = image_flattened
first_output = np.dot(image_flattened, self.weight) + self.bias
self.output = first_output
softmax_output = np.exp(first_output) / np.sum(np.exp(first_output), axis=0)
return softmax_output
def back_prop(self, dE_dY, alpha):
for i, gradient in enumerate(dE_dY):
if gradient == 0:
continue
transformation_eq = np.exp(self.output)
S_total = np.sum(transformation_eq)
dY_dZ = -transformation_eq[i]*transformation_eq / (S_total**2)
dY_dZ[i] = transformation_eq[i]*(S_total - transformation_eq[i]) / (S_total**2)
dZ_dw = self.flattened_input
dZ_db = 1
dZ_dX = self.weight
dE_dZ = gradient * dY_dZ
dE_dw = dZ_dw[np.newaxis].T @ dE_dZ[np.newaxis]
dE_db = dE_dZ * dZ_db
dE_dX = dZ_dX @ dE_dZ
self.weight -= alpha*dE_dw
self.bias -= alpha*dE_db
return dE_dX.reshape(self.original_shape)

Conclusions
In this post, we saw a theoretical introduction to the fundamental CNN architectural elements such as convolutional and pooling layers. I am positive that the step-by-step Python implementation will provide you a practical understanding of how these theoretical concepts can be translated into code.
I invite you to clone the GitHub repository containing the code and play with the main.py
script. Of course, this network doesn't achieve state-of-the-art performances, as it is not built for this objective, but nevertheless reaches a 96% accuracy after a few epochs.
Finally, in order to expand your knowledge about CNN and computer vision, I suggest checking some of the resources listed below.
If you liked this story, consider following me to be notified of my upcoming projects and articles!
References
- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron
- "ImageNet Classification with Deep Convolutional Neural Networks" by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
- "Very Deep Convolutional Networks for Large-Scale Image Recognition" by Karen Simonyan and Andrew Zisserman (VGGNet)
- "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili
- "Convolutional Neural Networks in Python: Master Data Science and Machine Learning with Modern Deep Learning in Python, Theano, and TensorFlow" by Jason Brownlee
- "Hands-On Convolutional Neural Networks with TensorFlow 2" by Alex Gotev