Retro Data Science: Testing the First Versions of YOLO

Author:Murphy  |  View: 22116  |  Time: 2025-03-23 18:17:36

The world of Data Science is constantly changing. Often, we cannot see these changes just because they are going slowly, but after some time, it is easy to watch back and see that the landscape became drastically different. Tools and libraries, which were at the cutting edge of progress only 10 years ago, can be completely forgotten today.

YOLO (You Only Look Once) is a popular object detection library. Its first version was released a pretty long time ago, in 2015. YOLO was working fast, it provided good results, and the pre-trained models were publicly available. The model quickly became popular, and the project is still actively improving nowadays. This gives us the opportunity to see how data science tools and libraries have evolved over the years. In this article, I will test different YOLO versions, from the very first V1 up to the latest V8.

For further testing, I will use this image:

Test image, made by author

Let's get started.

YOLO V1..V3

The very first paper, "You Only Look Once: Unified, Real-Time Object Detection," about YOLO was released in 2015. And surprisingly, YOLO v1 is still available for download. As Mr.Redmon, one of the authors of the original paper, wrote, he is keeping this version "for historical purposes", and this is really nice indeed. But can we run it today? The model is distributed in the form of two files. The configuration file "yolo.cfg" contains details about the neural network model:

[net]
batch=1
height=448
width=448
channels=3
momentum=0.9
decay=0.0005
...

[convolutional]
batch_normalize=1
filters=64
size=7
stride=2
pad=1
activation=leaky

And the second file "yolov1.weights", as the name suggests, contains the weights of the pre-trained model.

This type of format is not from PyTorch or Keras. It turned out that the model was created using Darknet, an open-source neural network framework written in C. This project is still available on GitHub, but it looks abandoned. At the moment of writing this article, there are 164 pull requests and 1794 open issues; the last commits were made in 2018, and later only README.md was changed (well, this is probably how the death of the project looks in the modern digital world).

The original Darknet project is abandoned; this is bad news. The good news is that the readNetFromDarknet method is still available in OpenCV, and it is present even in the latest OpenCV versions. So, we can easily try to load the original YOLO v1 model using the modern Python environment:

import cv2

model = cv2.dnn.readNetFromDarknet("yolo.cfg", "yolov1.weights")

Alas, it did not work; I only got an error:

darknet_io.cpp:902: error: 
(-212:Parsing error) Unknown layer type: local in function 'ReadDarknetFromCfgStream'

It turned out that "yolo.cfg" has a layer named "local", which is not supported by OpenCV, and I don't know if there is a workaround for that. Anyway, the YOLO v2 config does not have this layer anymore, and this model can be successfully loaded in OpenCV:

import cv2

model = cv2.dnn.readNetFromDarknet("yolov2.cfg", "yolov2.weights")

Using the model is not as easy as we might expect. First, we need to find the output layers of the model:

ln = model.getLayerNames()
output_layers = [ln[i - 1] for i in model.getUnconnectedOutLayers()]

Then we need to load the image and convert it into binary format, which the model can understand:

img = cv2.imread('test.jpg')
H, W = img.shape[:2]

blob = cv2.dnn.blobFromImage(img, 1/255.0, (608, 608), swapRB=True, crop=False)

Finally, we can run forward propagation. A "forward" method will run the calculations and return the requested layer outputs:

model.setInput(blob)
outputs = model.forward(output_layers)

Making the forward propagation is straightforward, but parsing the output can be a bit tricky. The model is producing 85-dimensional feature vectors as an output, where the first 4 digits represent object rectangles, the 5th digit is a probability of the presence of an object, and the last 80 digits contain the probability information for the 80 categories the model was trained on. Having this information, we can draw the labels over the original image:

threshold = 0.5
boxes, confidences, class_ids = [], [], []

# Get all boxes and labels
for output in outputs:
    for detection in output:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > threshold:
            center_x, center_y = int(detection[0] * W), int(detection[1] * H)
            width, height = int(detection[2] * W), int(detection[3] * H)
            left = center_x - width//2
            top = center_y - height//2
            boxes.append([left, top, width, height])
            class_ids.append(class_id)
            confidences.append(float(confidence))

# Combine boxes together using non-maximum suppression
indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

# All COCO classes
classes = "person;bicycle;car;motorbike;aeroplane;bus;train;truck;boat;traffic light;fire hydrant;stop sign;parking meter;bench;bird;" 
          "cat;dog;horse;sheep;cow;elephant;bear;zebra;giraffe;backpack;umbrella;handbag;tie;suitcase;frisbee;skis;snowboard;sports ball;kite;" 
          "baseball bat;baseball glove;skateboard;surfboard;tennis racket;bottle;wine glass;cup;fork;knife;spoon;bowl;banana;apple;sandwich;" 
          "orange;broccoli;carrot;hot dog;pizza;donut;cake;chair;sofa;pottedplant;bed;diningtable;toilet;tvmonitor;laptop;mouse;remote;keyboard;" 
          "cell phone;microwave;oven;toaster;sink;refrigerator;book;clock;vase;scissors;teddy bear;hair dryer;toothbrush".split(";")

# Draw rectangles on image
colors = np.random.randint(0, 255, size=(len(classes), 3), dtype='uint8')
for i in indices.flatten():
    x, y, w, h = boxes[i]
    color = [int(c) for c in colors[class_ids[i]]]
    cv2.rectangle(img, (x, y), (x + w, y + h), color, 2)
    text = f"{classes[class_ids[i]]}: {confidences[i]:.2f}"
    cv2.putText(img, text, (x + 2, y - 6), cv2.FONT_HERSHEY_COMPLEX, 0.5, color, 1)

# Show
cv2.imshow('window', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Here I use np.argmax to find the class ID with the maximum probability. The YOLO model was trained using the COCO (Common Objects in Context, Creative Commons Attribution 4.0 License) dataset, and for simplicity reasons, I placed all 80 label names directly in the code. I also used the OpenCV NMSBoxes method to combine embedded rectangles together.

The final result looks like this:

YOLO v2 results, Image by author

We successfully ran a model released in 2016 in a modern environment!

The next version, YOLO v3, was released two years later, in 2018, and we can also run it using the same code (the weights and config files are available online). As the authors wrote in the paper, the new model is more accurate, and we can easily verify this:

YOLO v3 results, Image by author

Indeed, a V3 model was able to find more objects on the same image. Those readers who are interested in technical details can read this TDS article written in 2018.

YOLO V5..V7

As we can see, the model loaded with the readNetFromDarknet method works, but the required code is pretty "low-level" and cumbersome. OpenCV developers decided to make life easier, and in 2019, a new _DetectionModel_ **** class was added to version 4.1.2. We can load the YOLO model this way; the general logic remains the same, but the required amount of code is much smaller. The model directly returns class IDs, confidence values, and rectangles in one method call:

import cv2

model = cv2.dnn_DetectionModel("yolov7.cfg", "yolov7.weights")
model.setInputParams(size=(640, 640), scale=1/255, mean=(127.5, 127.5, 127.5), swapRB=True)

class_ids, confidences, boxes = model.detect(img, confThreshold=0.5)

# Combine boxes together using non-maximum suppression
indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

# All COCO classes
classes = "person;bicycle;car;motorbike;aeroplane;bus;train;truck;boat;traffic light;fire hydrant;stop sign;parking meter;bench;bird;" 
          "cat;dog;horse;sheep;cow;elephant;bear;zebra;giraffe;backpack;umbrella;handbag;tie;suitcase;frisbee;skis;snowboard;sports ball;kite;" 
          "baseball bat;baseball glove;skateboard;surfboard;tennis racket;bottle;wine glass;cup;fork;knife;spoon;bowl;banana;apple;sandwich;" 
          "orange;broccoli;carrot;hot dog;pizza;donut;cake;chair;sofa;pottedplant;bed;diningtable;toilet;tvmonitor;laptop;mouse;remote;keyboard;" 
          "cell phone;microwave;oven;toaster;sink;refrigerator;book;clock;vase;scissors;teddy bear;hair dryer;toothbrush".split(";")

# Draw rectangles on image
colors = np.random.randint(0, 255, size=(len(classes), 3), dtype='uint8')
for i in indices.flatten():
    x, y, w, h = boxes[i]
    color = [int(c) for c in colors[class_ids[i]]]
    cv2.rectangle(img, (x, y), (x + w, y + h), color, 2)
    text = f"{classes[class_ids[i]]}: {confidences[i]:.2f}"
    cv2.putText(img, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 1)

# Show
cv2.imshow('window', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

As we can see, all the low-level code needed for extracting boxes and confidence values from the model output is not needed anymore.

The result of running YOLO v7 is, in general, the same:

YOLO v7 results, Image by author

YOLO V8

The 8th version was released in 2023, so I cannot consider it "retro", at least at the moment of writing this text. But just to compare the results, let's see the code required nowadays to run YOLO:

from ultralytics import YOLO
import supervision as sv

model = YOLO('yolov8m.pt')
results = model.predict(source=img, save=False, save_txt=False, verbose=False)
detections = sv.Detections.from_yolov8(results[0])

# Create list of labels
labels = []
for ind, class_id in enumerate(detections.class_id):
    labels.append(f"{model.model.names[class_id]}: {detections.confidence[ind]:.2f}")

# Draw rectangles on image
box_annotator = sv.BoxAnnotator(thickness=2, text_thickness=1, text_scale=0.4)
box_annotator.annotate(scene=img, detections=detections, labels=labels)

# Show
cv2.imshow('window', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

As we can see, the code became even more compact. We don't need to take care of dataset label names (the model provides a "names" property) or how to draw rectangles and labels on the image (there is a special BoxAnnotator class for that). We even don't need to download model weights anymore; the library will do it automatically for us. Compared to 2016, the program from 2023 "shrunk" from about 50 to about 5 lines of code! It is obviously a nice improvement, and modern developers don't need to know about forward propagation or the output level format anymore. The model just works as a black box with some "magic" inside. Is it good or bad? I don't know

Tags: Algorithms Data Science Image Processing Programming Python

Comment