Skip to content

5.2 Techniques

1. Object Tracking with YOLOv8

Object tracking is the task of the identifying and monitoring objects as they move across frames in a video sequence. This task is essential to applications such as traffic monitoring, security surveillance, autonomous vehicles, and sports analytics. The typical object tracking pipeline consists of the following steps:

  1. Feature Extraction & Object Detection: Identifying and locating objects of interest in each frame of a video. Popular detection algorithms include YOLO and Faster R-CNN. Extracting distinctive features from detected objects aids in tracking. The features can differentiate between objects and ensure consistent tracking of the same objects across frames.
  2. State Estimation: Predicting the position of the objects in subsequent frames. Models like the Kalman filter, Particle filter, or deep learning-based methods are often employed. For example, in autonomous vehicles, these methods can predict where a pedestrian will be in the next few seconds.
  3. Data Association: Matching detected objects in the current frame with tracked objects from previous frames. This can be based on image features, state estimation, or both.
  4. Track Management: Creating new tracks for newly detected objects and terminating tracks for objects that are no longer visible.

In this exercise, we’ll use YOLOv8 for object detection and DeepSORT for object tracking, which combines these steps into an efficient real-time tracking system.

Steps:

  1. Read video and ground truth data:

    • Download and read the input video file.
    • Parse the ground truth object tracks.
    • Visualise the ground truth.
  2. Object tracking with YOLOv8::

    • Load a pre-trained YOLOv8 model.
    • Run object detection on the selected frame and observe the result.
    • Count the number of different objects.
  3. Object tracking with DeepSORT:

    • Initialise the DeepSORT tracker with appropriate parameters.
    • Iterate through the video frames, for each frame:
    • Perform object detection using YOLOv8.
    • Update the DeepSORT tracker with new detections.
    • Visualise the tracked objects.
  • Save the processed video with visualisations.
# Write a function to parse the ground truth labels
# Write a function to draw bounding boxes on the tracked targets
# Visualise the ground truth labels.
import cv2
import matplotlib.pyplot as plt
def parse_ground_truth_file(file):
"""
Parse the ground truth bounding box file and organize as a list of lists.
Each index of the list corresponds to a frame, and contains a list of bounding boxes.
"""
bounding_boxes = []
with open(file, 'r') as f:
for line in f.readlines():
line_data = line.strip().split(',')
# Convert the values
frame_id = int(line_data[0]) # Frame index (1-based)
obj_id = int(line_data[1]) # Object ID
x = int(line_data[2]) # X coordinate
y = int(line_data[3]) # Y coordinate
w = int(line_data[4]) # Width
h = int(line_data[5]) # Height
# Ensure the list has enough frames
while len(bounding_boxes) < frame_id:
bounding_boxes.append([]) # Add empty lists for missing frames
# Append the bounding box to the corresponding frame's list
bounding_boxes[frame_id - 1].append([obj_id, x, y, w, h]) # frame_id - 1 for zero-indexing
return bounding_boxes
def draw_detection_on_frame(frame, boxes):
"""
Draw bounding boxes on a single frame.
"""
# Create a copy of the original frame to avoid modifying it
frame = frame.copy()
for box in boxes:
obj_id, x, y, w, h = box # Unpack the values
label = f'ID: {obj_id}'
# Draw a rectangle around the object
color = (0, 255, 0) # Pick a colour
thickness = 3
cv2.rectangle(frame, (x, y), (x + w, y + h), color, thickness)
# Label the object
cv2.putText(frame, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 1)
return frame
video_path = '/content/ADL-Rundle-6/ADL-Rundle-6.mp4'
# Pick a frame
frame_number = 100
cap = cv2.VideoCapture(video_path)
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) # Set frame number
_, frame = cap.read()
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
cap.release()
# Draw the frame with ground truth detection boxes
ground_truths = parse_ground_truth_file('/content/ADL-Rundle-6/gt.txt')
gt_frame = draw_detection_on_frame(frame, ground_truths[frame_number])
# Display the frames side by side
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(frame)
plt.title('Original')
plt.axis('off')
plt.subplot(1, 2, 2)
plt.imshow(gt_frame)
plt.title('Ground Truth')
plt.axis('off')
plt.tight_layout()
plt.show()

YOLOv8

YOLOv8, developed by Ultralytics (the creators of YOLOv5), is one of the computer vision model in the YOLO (You Only Look Once) family. It offers significant improvements in speed, accuracy, and versatility over its predecessors. The YOLOv8 models are capable of several computer vision tasks, i.e. object detection, image segmentation, image classification, and pose estimation. They can be constructed both via the command line interface as well as the Python API.

Refer to https://docs.ultralytics.com/models/yolov8/ for complete documentation.

###Model Types

  • YOLOv8n (Nano): Smallest and fastest, suitable for edge devices
  • YOLOv8s (Small): Balances speed and accuracy for general use
  • YOLOv8m (Medium): Higher accuracy, moderate speed
  • YOLOv8l (Large): High accuracy, slower speed
  • YOLOv8x (Extra Large): Highest accuracy, slowest speed
  • Add -seg, -cls, or -pose to the model name for image segmentation, image classification, and pose estimation tasks

Usage

  • YOLO(model_type): Load pre-trained model
  • model.train(data, ephochs, img_size): Train a YOLOv8 model
  • model.predict(img): Perform object detection on an image
  • model.export(format): Export model weights

First, let’s load a pretrained YOLOv8 and use it to detect objects in one frame of the video.

from ultralytics import YOLO
import cv2
import matplotlib.pyplot as plt
# Load a pre-trained YOLO model, choose one
model = YOLO('yolov8m.pt')
# Use YOLOv8 to make predictions on the same frame as before
results = model.predict(frame)
# YOLOv8 returns a list of results, take the first result (since we are working with one frame)
boxes = results[0].boxes
formatted_boxes = []
# Iterate over the detected boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].int().tolist()
cls = box.cls.item() # Get class ID
w = x2 - x1
h = y2 - y1
# Append the formatted box to the list
if cls == 0: # Only track people
formatted_boxes.append([cls, x1, y1, w, h])
# Draw the YOLOv8 bounding boxes on the frame
prediction_frame = draw_detection_on_frame(frame, formatted_boxes)
# Display the ground truth and the predicted frame
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(gt_frame)
plt.title('Ground Truth')
plt.axis('off') # Hide axes
plt.subplot(1, 2, 2)
plt.imshow(prediction_frame)
plt.title('YOLOv8 Prediction')
plt.axis('off')
plt.tight_layout()
plt.show()

2. Object Tracking with DeepSORT

DeepSORT (Deep Simple Online and Realtime Tracking) is an object tracking algorithm that extends the original SORT (Simple Online and Realtime Tracking) algorithm with a deep association metric.

For each detected object, DeepSORT uses a neural network to extract appearance features. Using a Kalman filter, the algorithm also predicts the new locations of existing tracks in the current frame. The predicted locations and existing tracks are matched with both motion information (from the Kalman filter) and appearance information (from the feature extractor).

Refer to https://pypi.org/project/deep-sort-realtime/ for complete documentation.

###Usage

  • DeepSort(max_age, n_init, max_cosine_distance): Initialise the DeepSORT tracker
    • max_age: Maximum number of frames a track can be inactive before being deleted. Higher values allow for longer occlusions but may lead to ID switches
    • n_init: Number of consecutive detections before the track is confirmed. Higher values increase robustness but may delay track creation
    • max_cosine_distance: Threshold for feature similarity. Lower values are more strict in associating detections with existing tracks
  • tracker.update_tracks(detections, frame): Update tracks using detections for the new frame
    • detections: A list of new object detections in the current frame. Each detection includes:
      • Bounding box coordinates [x1, y1, w, h]
      • Detection confidence
      • Class ID
    • frame: The current video frame. This is used for feature extraction
    • Returns the current tracks

Next, let’s use DeepSORT to track the objects detected by YOLOv8 across multiple frames.

# Use DeepSORT and results of YOLOv8 to track objects
from deep_sort_realtime.deepsort_tracker import DeepSort
from tqdm import tqdm
# Constants
CONFIDENCE_THRESHOLD = 0.5
# Initialise DeepSORT tracker
tracker = DeepSort(max_age=30, n_init=3, max_cosine_distance=0.3)
def generate_predictions(video_path, tracker, model):
"""
Run object detection and tracking, and save predictions to a file.
Only tracks the people class.
"""
# Load the video with cv2.VideoCapture
cap = cv2.VideoCapture(video_path)
total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
predictions = []
# Make predictions per frame
with tqdm(total=total_frames, desc="Processing frames for prediction") as pbar:
frame_id = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret: # Stop when there are no more frames
break
frame_id += 1
# Run YOLOv8 detection
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = model.predict(frame_rgb, verbose=False)[0]
# Process detections for DeepSORT
detections = []
for data in results.boxes.data.tolist():
confidence = data[4]
# Pass over low confidence prediction
if confidence < CONFIDENCE_THRESHOLD:
continue
# Get bounding box coordinates and class ID
x1, y1, x2, y2 = data[0], data[1], data[2], data[3]
class_id = data[5]
if class_id == 0: # Only track people
bbox = [x1, y1, x2 - x1, y2 - y1] # Convert to [x, y, w, h]
detections.append((bbox, confidence, class_id))
# Update DeepSORT tracker with the new detections
tracks = tracker.update_tracks(detections, frame=frame_rgb)
# Save and output the detection results
boxes = []
# Save tracks
for track in tracks:
if not track.is_confirmed():
continue
track_id = track.track_id
ltrb = track.to_ltrb()
x1, y1, x2, y2 = ltrb # Left, top, right, bottom
w, h = x2 - x1, y2 - y1
box = [int(track_id), int(x1), int(y1), int(w), int(h)]
boxes.append(box)
# Append predictions for the current frame
predictions.append(boxes)
# Update progress bar
pbar.update(1)
return predictions
def generate_video_with_boxes(video_path, predictions, output_video_path):
"""
Generate a video with bounding boxes drawn on each frame based on predictions.
"""
# Read video
cap = cv2.VideoCapture(video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Initialise video writer
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_video_path, fourcc, 30, (width, height))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
current_frame = 0
with tqdm(total=total_frames, desc="Generating video") as pbar:
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Get predictions for the current frame
boxes = predictions[current_frame]
# Draw bounding boxes
frame_with_boxes = draw_detection_on_frame(frame, boxes)
# Write the frame with boxes
out.write(frame_with_boxes)
current_frame += 1
pbar.update(1)
cap.release()
out.release()
print(f"Video saved to {output_video_path}")
video_path = '/content/ADL-Rundle-6/ADL-Rundle-6.mp4'
output_video_path = '/content/ADL-Rundle-6/ADL-Rundle-6-tracked.mp4'
# Run predictions
predictions = generate_predictions(video_path, tracker, model)
# Generate a video with bounding boxes using saved predictions
generate_video_with_boxes(video_path, predictions, output_video_path)

3. Performance Evaluation

In this exercise, we will:

  • Match ground truth detections with predictions using the Hungarian algorithm.
  • Compute various performance metrics including MOTA, MOTP, Precision, Recall, and F1-Score.

The evaluation process assesses the performance of a people tracking system by comparing predicted bounding boxes against ground truth data. The bounding boxes in the predictions will need to be matched to the corresponding boxes in the ground truth.

Steps

  1. Match Detections:

    • Implement the match_detections(ground_truths, predictions, iou_threshold) function.
    • Use the Hungarian algorithm (linear_sum_assignment) to find the optimal matching between ground truths and predictions based on IoU.
  2. Compute Performance Metrics:

    • Implement the compute_metrics(...) function that calculates:
      • MOTA (Multiple Object Tracking Accuracy): MOTA = 1 - (misses + false_positives + id_switches) / total_gt
      • MOTP (Multiple Object Tracking Precision): MOTP = total_iou / number of matches
      • Precision: Precision = number of matches / (number of matches + false positives)
      • Recall: Recall = number of matches / total_gt
      • F1 Score: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
  3. Evaluate Tracking Performance:

    • Implement the evaluate_people_tracking(...) function.
    • Iterate through frames to match detections and compute metrics.

Finally, adjust hyperparameters to improve your evaluation results. Download ADL-Rundle-6-tracked.mp4 and observe the results of your object tracking algorithm.

import numpy as np
from scipy.optimize import linear_sum_assignment
def calculate_iou(box1, box2):
_, x1, y1, w1, h1 = box1
_, x2, y2, w2, h2 = box2
xi1, yi1 = max(x1, x2), max(y1, y2)
xi2, yi2 = min(x1 + w1, x2 + w2), min(y1 + h1, y2 + h2)
intersection = max(0, xi2 - xi1) * max(0, yi2 - yi1)
union = w1 * h1 + w2 * h2 - intersection
return intersection / union if union > 0 else 0
def match_detections(ground_truths, predictions, iou_threshold):
iou_matrix = np.zeros((len(ground_truths), len(predictions)))
for i, gt in enumerate(ground_truths):
for j, pred in enumerate(predictions):
iou_matrix[i, j] = calculate_iou(gt, pred)
# Find the optimal matching between ground truths and predictions such that IoU is maximised
matched_gt, matched_pred = linear_sum_assignment(-iou_matrix)
matches = []
unmatched_gt = []
unmatched_pred = []
for gt, pred in zip(matched_gt, matched_pred):
if iou_matrix[gt, pred] >= iou_threshold:
matches.append((gt, pred))
else:
unmatched_gt.append(gt)
unmatched_pred.append(pred)
return matches, unmatched_gt, unmatched_pred, iou_matrix
def compute_metrics(matches, misses, false_positives, id_switches, total_iou, total_gt):
mota = 1 - (misses + false_positives + id_switches) / total_gt if total_gt > 0 else 0
motp = total_iou / len(matches) if len(matches) > 0 else 0
precision = len(matches) / (len(matches) + false_positives) if (len(matches) + false_positives) > 0 else 0
recall = len(matches) / total_gt if total_gt > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
return {
"MOTA": mota,
"MOTP": motp,
"Precision": precision,
"Recall": recall,
"F1-Score": f1_score
}
def evaluate_people_tracking(predictions, ground_truth, iou_threshold=0.5):
matches = []
misses = 0
false_positives = 0
id_switches = 0
total_iou = 0
prev_matches = {}
# Assuming both `predictions` and `ground_truth` are lists of lists, we iterate through frames
total_frames = min(len(ground_truth), len(predictions)) # Handle the case where one is shorter
for frame_idx in range(total_frames):
gt_boxes = ground_truth[frame_idx]
frame_predictions = predictions[frame_idx]
frame_matches, unmatched_gt, unmatched_pred, iou_matrix = match_detections(gt_boxes, frame_predictions, iou_threshold)
for gt, pred in frame_matches:
matches.append((frame_idx, gt, pred))
total_iou += iou_matrix[gt, pred]
gt_id = gt_boxes[gt][0]
pred_id = frame_predictions[pred][1]
if gt_id in prev_matches and prev_matches[gt_id] != pred_id:
id_switches += 1
prev_matches[gt_id] = pred_id
misses += len(unmatched_gt)
false_positives += len(unmatched_pred)
total_gt = sum(len(boxes) for boxes in ground_truth)
return compute_metrics(matches, misses, false_positives, id_switches, total_iou, total_gt)
# Assuming `predictions` and `ground_truths` are both lists of lists
metrics = evaluate_people_tracking(predictions, ground_truths)
# Display evaluation results
print("\nMetrics:")
for key, value in metrics.items():
print(f"{key}: {value}")
output
Metrics:
MOTA: -0.1599121581153924
MOTP: 0.7754386561787157
Precision: 0.7690596562184024
Recall: 0.7592333799161509
F1-Score: 0.7641149286718908