5.2 Techniques

1. Object Tracking with YOLOv8

Object tracking is the task of the identifying and monitoring objects as they move across frames in a video sequence. This task is essential to applications such as traffic monitoring, security surveillance, autonomous vehicles, and sports analytics. The typical object tracking pipeline consists of the following steps:

Feature Extraction & Object Detection: Identifying and locating objects of interest in each frame of a video. Popular detection algorithms include YOLO and Faster R-CNN. Extracting distinctive features from detected objects aids in tracking. The features can differentiate between objects and ensure consistent tracking of the same objects across frames.
State Estimation: Predicting the position of the objects in subsequent frames. Models like the Kalman filter, Particle filter, or deep learning-based methods are often employed. For example, in autonomous vehicles, these methods can predict where a pedestrian will be in the next few seconds.
Data Association: Matching detected objects in the current frame with tracked objects from previous frames. This can be based on image features, state estimation, or both.
Track Management: Creating new tracks for newly detected objects and terminating tracks for objects that are no longer visible.

In this exercise, we’ll use YOLOv8 for object detection and DeepSORT for object tracking, which combines these steps into an efficient real-time tracking system.

Steps:

Read video and ground truth data:
- Download and read the input video file.
- Parse the ground truth object tracks.
- Visualise the ground truth.
Object tracking with YOLOv8::
- Load a pre-trained YOLOv8 model.
- Run object detection on the selected frame and observe the result.
- Count the number of different objects.
Object tracking with DeepSORT:
- Initialise the DeepSORT tracker with appropriate parameters.
- Iterate through the video frames, for each frame:
- Perform object detection using YOLOv8.
- Update the DeepSORT tracker with new detections.
- Visualise the tracked objects.

Save the processed video with visualisations.

# Write a function to parse the ground truth labels
# Write a function to draw bounding boxes on the tracked targets
# Visualise the ground truth labels.
import cv2
import matplotlib.pyplot as plt

def parse_ground_truth_file(file):
    """
    Parse the ground truth bounding box file and organize as a list of lists.
    Each index of the list corresponds to a frame, and contains a list of bounding boxes.
    """
    bounding_boxes = []

    with open(file, 'r') as f:
        for line in f.readlines():
            line_data = line.strip().split(',')
            # Convert the values
            frame_id = int(line_data[0])     # Frame index (1-based)
            obj_id = int(line_data[1])       # Object ID
            x = int(line_data[2])            # X coordinate
            y = int(line_data[3])            # Y coordinate
            w = int(line_data[4])            # Width
            h = int(line_data[5])            # Height

            # Ensure the list has enough frames
            while len(bounding_boxes) < frame_id:
                bounding_boxes.append([])  # Add empty lists for missing frames

            # Append the bounding box to the corresponding frame's list
            bounding_boxes[frame_id - 1].append([obj_id, x, y, w, h])  # frame_id - 1 for zero-indexing

    return bounding_boxes


def draw_detection_on_frame(frame, boxes):
    """
    Draw bounding boxes on a single frame.
    """
    # Create a copy of the original frame to avoid modifying it
    frame = frame.copy()

    for box in boxes:
        obj_id, x, y, w, h = box  # Unpack the values
        label = f'ID: {obj_id}'

        # Draw a rectangle around the object
        color = (0, 255, 0)  # Pick a colour
        thickness = 3
        cv2.rectangle(frame, (x, y), (x + w, y + h), color, thickness)

        # Label the object
        cv2.putText(frame, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 1)

    return frame


video_path = '/content/ADL-Rundle-6/ADL-Rundle-6.mp4'

# Pick a frame
frame_number = 100
cap = cv2.VideoCapture(video_path)
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) # Set frame number
_, frame = cap.read()
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
cap.release()

# Draw the frame with ground truth detection boxes
ground_truths = parse_ground_truth_file('/content/ADL-Rundle-6/gt.txt')
gt_frame = draw_detection_on_frame(frame, ground_truths[frame_number])

# Display the frames side by side
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(frame)
plt.title('Original')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(gt_frame)
plt.title('Ground Truth')
plt.axis('off')

plt.tight_layout()
plt.show()

YOLOv8

YOLOv8, developed by Ultralytics (the creators of YOLOv5), is one of the computer vision model in the YOLO (You Only Look Once) family. It offers significant improvements in speed, accuracy, and versatility over its predecessors. The YOLOv8 models are capable of several computer vision tasks, i.e. object detection, image segmentation, image classification, and pose estimation. They can be constructed both via the command line interface as well as the Python API.

Refer to https://docs.ultralytics.com/models/yolov8/ for complete documentation.

###Model Types

YOLOv8n (Nano): Smallest and fastest, suitable for edge devices
YOLOv8s (Small): Balances speed and accuracy for general use
YOLOv8m (Medium): Higher accuracy, moderate speed
YOLOv8l (Large): High accuracy, slower speed
YOLOv8x (Extra Large): Highest accuracy, slowest speed
Add -seg, -cls, or -pose to the model name for image segmentation, image classification, and pose estimation tasks

Usage

YOLO(model_type): Load pre-trained model
model.train(data, ephochs, img_size): Train a YOLOv8 model
model.predict(img): Perform object detection on an image
model.export(format): Export model weights

First, let’s load a pretrained YOLOv8 and use it to detect objects in one frame of the video.

from ultralytics import YOLO
import cv2
import matplotlib.pyplot as plt

# Load a pre-trained YOLO model, choose one
model = YOLO('yolov8m.pt')

# Use YOLOv8 to make predictions on the same frame as before
results = model.predict(frame)

# YOLOv8 returns a list of results, take the first result (since we are working with one frame)
boxes = results[0].boxes

formatted_boxes = []

# Iterate over the detected boxes
for box in boxes:
    x1, y1, x2, y2 = box.xyxy[0].int().tolist()
    cls = box.cls.item()  # Get class ID
    w = x2 - x1
    h = y2 - y1
    # Append the formatted box to the list
    if cls == 0:  # Only track people
      formatted_boxes.append([cls, x1, y1, w, h])

# Draw the YOLOv8 bounding boxes on the frame
prediction_frame = draw_detection_on_frame(frame, formatted_boxes)

# Display the ground truth and the predicted frame
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.imshow(gt_frame)
plt.title('Ground Truth')
plt.axis('off')  # Hide axes

plt.subplot(1, 2, 2)
plt.imshow(prediction_frame)
plt.title('YOLOv8 Prediction')
plt.axis('off')

plt.tight_layout()
plt.show()

2. Object Tracking with DeepSORT

DeepSORT (Deep Simple Online and Realtime Tracking) is an object tracking algorithm that extends the original SORT (Simple Online and Realtime Tracking) algorithm with a deep association metric.

For each detected object, DeepSORT uses a neural network to extract appearance features. Using a Kalman filter, the algorithm also predicts the new locations of existing tracks in the current frame. The predicted locations and existing tracks are matched with both motion information (from the Kalman filter) and appearance information (from the feature extractor).

Refer to https://pypi.org/project/deep-sort-realtime/ for complete documentation.

###Usage

DeepSort(max_age, n_init, max_cosine_distance): Initialise the DeepSORT tracker
- max_age: Maximum number of frames a track can be inactive before being deleted. Higher values allow for longer occlusions but may lead to ID switches
- n_init: Number of consecutive detections before the track is confirmed. Higher values increase robustness but may delay track creation
- max_cosine_distance: Threshold for feature similarity. Lower values are more strict in associating detections with existing tracks
tracker.update_tracks(detections, frame): Update tracks using detections for the new frame
- detections: A list of new object detections in the current frame. Each detection includes:
  - Bounding box coordinates [x1, y1, w, h]
  - Detection confidence
  - Class ID
- frame: The current video frame. This is used for feature extraction
- Returns the current tracks

Next, let’s use DeepSORT to track the objects detected by YOLOv8 across multiple frames.

# Use DeepSORT and results of YOLOv8 to track objects
from deep_sort_realtime.deepsort_tracker import DeepSort
from tqdm import tqdm

# Constants
CONFIDENCE_THRESHOLD = 0.5

# Initialise DeepSORT tracker
tracker = DeepSort(max_age=30, n_init=3, max_cosine_distance=0.3)

def generate_predictions(video_path, tracker, model):
    """
    Run object detection and tracking, and save predictions to a file.
    Only tracks the people class.
    """
    # Load the video with cv2.VideoCapture
    cap = cv2.VideoCapture(video_path)
    total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)

    predictions = []

    # Make predictions per frame
    with tqdm(total=total_frames, desc="Processing frames for prediction") as pbar:

        frame_id = 0

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:  # Stop when there are no more frames
                break

            frame_id += 1

            # Run YOLOv8 detection
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            results = model.predict(frame_rgb, verbose=False)[0]

            # Process detections for DeepSORT
            detections = []
            for data in results.boxes.data.tolist():
                confidence = data[4]
                # Pass over low confidence prediction
                if confidence < CONFIDENCE_THRESHOLD:
                    continue
                # Get bounding box coordinates and class ID
                x1, y1, x2, y2 = data[0], data[1], data[2], data[3]
                class_id = data[5]

                if class_id == 0:  # Only track people
                    bbox = [x1, y1, x2 - x1, y2 - y1]  # Convert to [x, y, w, h]
                    detections.append((bbox, confidence, class_id))

            # Update DeepSORT tracker with the new detections
            tracks = tracker.update_tracks(detections, frame=frame_rgb)

            # Save and output the detection results
            boxes = []

            # Save tracks
            for track in tracks:
                if not track.is_confirmed():
                    continue
                track_id = track.track_id
                ltrb = track.to_ltrb()
                x1, y1, x2, y2 = ltrb  # Left, top, right, bottom
                w, h = x2 - x1, y2 - y1
                box = [int(track_id), int(x1), int(y1), int(w), int(h)]
                boxes.append(box)

            # Append predictions for the current frame
            predictions.append(boxes)

            # Update progress bar
            pbar.update(1)

    return predictions


def generate_video_with_boxes(video_path, predictions, output_video_path):
    """
    Generate a video with bounding boxes drawn on each frame based on predictions.
    """
    # Read video
    cap = cv2.VideoCapture(video_path)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # Initialise video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video_path, fourcc, 30, (width, height))

    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    current_frame = 0

    with tqdm(total=total_frames, desc="Generating video") as pbar:

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            # Get predictions for the current frame
            boxes = predictions[current_frame]

            # Draw bounding boxes
            frame_with_boxes = draw_detection_on_frame(frame, boxes)

            # Write the frame with boxes
            out.write(frame_with_boxes)

            current_frame += 1
            pbar.update(1)

    cap.release()
    out.release()
    print(f"Video saved to {output_video_path}")

video_path = '/content/ADL-Rundle-6/ADL-Rundle-6.mp4'
output_video_path = '/content/ADL-Rundle-6/ADL-Rundle-6-tracked.mp4'

# Run predictions
predictions = generate_predictions(video_path, tracker, model)

# Generate a video with bounding boxes using saved predictions
generate_video_with_boxes(video_path, predictions, output_video_path)

3. Performance Evaluation

In this exercise, we will:

Match ground truth detections with predictions using the Hungarian algorithm.
Compute various performance metrics including MOTA, MOTP, Precision, Recall, and F1-Score.

The evaluation process assesses the performance of a people tracking system by comparing predicted bounding boxes against ground truth data. The bounding boxes in the predictions will need to be matched to the corresponding boxes in the ground truth.

Steps

Match Detections:
- Implement the match_detections(ground_truths, predictions, iou_threshold) function.
- Use the Hungarian algorithm (linear_sum_assignment) to find the optimal matching between ground truths and predictions based on IoU.
Compute Performance Metrics:
- Implement the compute_metrics(...) function that calculates:
  - MOTA (Multiple Object Tracking Accuracy): MOTA = 1 - (misses + false_positives + id_switches) / total_gt
  - MOTP (Multiple Object Tracking Precision): MOTP = total_iou / number of matches
  - Precision: Precision = number of matches / (number of matches + false positives)
  - Recall: Recall = number of matches / total_gt
  - F1 Score: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Evaluate Tracking Performance:
- Implement the evaluate_people_tracking(...) function.
- Iterate through frames to match detections and compute metrics.

Finally, adjust hyperparameters to improve your evaluation results. Download ADL-Rundle-6-tracked.mp4 and observe the results of your object tracking algorithm.

import numpy as np
from scipy.optimize import linear_sum_assignment

def calculate_iou(box1, box2):
    _, x1, y1, w1, h1 = box1
    _, x2, y2, w2, h2 = box2
    xi1, yi1 = max(x1, x2), max(y1, y2)
    xi2, yi2 = min(x1 + w1, x2 + w2), min(y1 + h1, y2 + h2)
    intersection = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    union = w1 * h1 + w2 * h2 - intersection
    return intersection / union if union > 0 else 0

def match_detections(ground_truths, predictions, iou_threshold):
    iou_matrix = np.zeros((len(ground_truths), len(predictions)))
    for i, gt in enumerate(ground_truths):
        for j, pred in enumerate(predictions):
            iou_matrix[i, j] = calculate_iou(gt, pred)

    # Find the optimal matching between ground truths and predictions such that IoU is maximised
    matched_gt, matched_pred = linear_sum_assignment(-iou_matrix)

    matches = []
    unmatched_gt = []
    unmatched_pred = []

    for gt, pred in zip(matched_gt, matched_pred):
        if iou_matrix[gt, pred] >= iou_threshold:
            matches.append((gt, pred))
        else:
            unmatched_gt.append(gt)
            unmatched_pred.append(pred)

    return matches, unmatched_gt, unmatched_pred, iou_matrix

def compute_metrics(matches, misses, false_positives, id_switches, total_iou, total_gt):
    mota = 1 - (misses + false_positives + id_switches) / total_gt if total_gt > 0 else 0
    motp = total_iou / len(matches) if len(matches) > 0 else 0

    precision = len(matches) / (len(matches) + false_positives) if (len(matches) + false_positives) > 0 else 0
    recall = len(matches) / total_gt if total_gt > 0 else 0

    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return {
        "MOTA": mota,
        "MOTP": motp,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1_score
    }

def evaluate_people_tracking(predictions, ground_truth, iou_threshold=0.5):
    matches = []
    misses = 0
    false_positives = 0
    id_switches = 0
    total_iou = 0

    prev_matches = {}

    # Assuming both `predictions` and `ground_truth` are lists of lists, we iterate through frames
    total_frames = min(len(ground_truth), len(predictions))  # Handle the case where one is shorter

    for frame_idx in range(total_frames):
        gt_boxes = ground_truth[frame_idx]
        frame_predictions = predictions[frame_idx]

        frame_matches, unmatched_gt, unmatched_pred, iou_matrix = match_detections(gt_boxes, frame_predictions, iou_threshold)

        for gt, pred in frame_matches:
            matches.append((frame_idx, gt, pred))
            total_iou += iou_matrix[gt, pred]

            gt_id = gt_boxes[gt][0]
            pred_id = frame_predictions[pred][1]
            if gt_id in prev_matches and prev_matches[gt_id] != pred_id:
                id_switches += 1
            prev_matches[gt_id] = pred_id

        misses += len(unmatched_gt)
        false_positives += len(unmatched_pred)

    total_gt = sum(len(boxes) for boxes in ground_truth)

    return compute_metrics(matches, misses, false_positives, id_switches, total_iou, total_gt)

# Assuming `predictions` and `ground_truths` are both lists of lists
metrics = evaluate_people_tracking(predictions, ground_truths)

# Display evaluation results
print("\nMetrics:")
for key, value in metrics.items():
    print(f"{key}: {value}")

Metrics:
MOTA: -0.1599121581153924
MOTP: 0.7754386561787157
Precision: 0.7690596562184024
Recall: 0.7592333799161509
F1-Score: 0.7641149286718908

Enter Password