5.2 Techniques
1. Object Tracking with YOLOv8
Object tracking is the task of the identifying and monitoring objects as they move across frames in a video sequence. This task is essential to applications such as traffic monitoring, security surveillance, autonomous vehicles, and sports analytics. The typical object tracking pipeline consists of the following steps:
- Feature Extraction & Object Detection: Identifying and locating objects of interest in each frame of a video. Popular detection algorithms include YOLO and Faster R-CNN. Extracting distinctive features from detected objects aids in tracking. The features can differentiate between objects and ensure consistent tracking of the same objects across frames.
- State Estimation: Predicting the position of the objects in subsequent frames. Models like the Kalman filter, Particle filter, or deep learning-based methods are often employed. For example, in autonomous vehicles, these methods can predict where a pedestrian will be in the next few seconds.
- Data Association: Matching detected objects in the current frame with tracked objects from previous frames. This can be based on image features, state estimation, or both.
- Track Management: Creating new tracks for newly detected objects and terminating tracks for objects that are no longer visible.
In this exercise, we’ll use YOLOv8 for object detection and DeepSORT for object tracking, which combines these steps into an efficient real-time tracking system.
Steps:
-
Read video and ground truth data:
- Download and read the input video file.
- Parse the ground truth object tracks.
- Visualise the ground truth.
-
Object tracking with YOLOv8::
- Load a pre-trained YOLOv8 model.
- Run object detection on the selected frame and observe the result.
- Count the number of different objects.
-
Object tracking with DeepSORT:
- Initialise the DeepSORT tracker with appropriate parameters.
- Iterate through the video frames, for each frame:
- Perform object detection using YOLOv8.
- Update the DeepSORT tracker with new detections.
- Visualise the tracked objects.
- Save the processed video with visualisations.
# Write a function to parse the ground truth labels# Write a function to draw bounding boxes on the tracked targets# Visualise the ground truth labels.import cv2import matplotlib.pyplot as plt
def parse_ground_truth_file(file): """ Parse the ground truth bounding box file and organize as a list of lists. Each index of the list corresponds to a frame, and contains a list of bounding boxes. """ bounding_boxes = []
with open(file, 'r') as f: for line in f.readlines(): line_data = line.strip().split(',') # Convert the values frame_id = int(line_data[0]) # Frame index (1-based) obj_id = int(line_data[1]) # Object ID x = int(line_data[2]) # X coordinate y = int(line_data[3]) # Y coordinate w = int(line_data[4]) # Width h = int(line_data[5]) # Height
# Ensure the list has enough frames while len(bounding_boxes) < frame_id: bounding_boxes.append([]) # Add empty lists for missing frames
# Append the bounding box to the corresponding frame's list bounding_boxes[frame_id - 1].append([obj_id, x, y, w, h]) # frame_id - 1 for zero-indexing
return bounding_boxes
def draw_detection_on_frame(frame, boxes): """ Draw bounding boxes on a single frame. """ # Create a copy of the original frame to avoid modifying it frame = frame.copy()
for box in boxes: obj_id, x, y, w, h = box # Unpack the values label = f'ID: {obj_id}'
# Draw a rectangle around the object color = (0, 255, 0) # Pick a colour thickness = 3 cv2.rectangle(frame, (x, y), (x + w, y + h), color, thickness)
# Label the object cv2.putText(frame, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 1)
return frame
video_path = '/content/ADL-Rundle-6/ADL-Rundle-6.mp4'
# Pick a frameframe_number = 100cap = cv2.VideoCapture(video_path)cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) # Set frame number_, frame = cap.read()frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)cap.release()
# Draw the frame with ground truth detection boxesground_truths = parse_ground_truth_file('/content/ADL-Rundle-6/gt.txt')gt_frame = draw_detection_on_frame(frame, ground_truths[frame_number])
# Display the frames side by sideplt.figure(figsize=(12, 6))plt.subplot(1, 2, 1)plt.imshow(frame)plt.title('Original')plt.axis('off')
plt.subplot(1, 2, 2)plt.imshow(gt_frame)plt.title('Ground Truth')plt.axis('off')
plt.tight_layout()plt.show()
YOLOv8
YOLOv8, developed by Ultralytics (the creators of YOLOv5), is one of the computer vision model in the YOLO (You Only Look Once) family. It offers significant improvements in speed, accuracy, and versatility over its predecessors. The YOLOv8 models are capable of several computer vision tasks, i.e. object detection, image segmentation, image classification, and pose estimation. They can be constructed both via the command line interface as well as the Python API.
Refer to https://docs.ultralytics.com/models/yolov8/ for complete documentation.
###Model Types
- YOLOv8n (Nano): Smallest and fastest, suitable for edge devices
- YOLOv8s (Small): Balances speed and accuracy for general use
- YOLOv8m (Medium): Higher accuracy, moderate speed
- YOLOv8l (Large): High accuracy, slower speed
- YOLOv8x (Extra Large): Highest accuracy, slowest speed
- Add -seg, -cls, or -pose to the model name for image segmentation, image classification, and pose estimation tasks
Usage
YOLO(model_type)
: Load pre-trained modelmodel.train(data, ephochs, img_size)
: Train a YOLOv8 modelmodel.predict(img)
: Perform object detection on an imagemodel.export(format)
: Export model weights
First, let’s load a pretrained YOLOv8 and use it to detect objects in one frame of the video.
from ultralytics import YOLOimport cv2import matplotlib.pyplot as plt
# Load a pre-trained YOLO model, choose onemodel = YOLO('yolov8m.pt')
# Use YOLOv8 to make predictions on the same frame as beforeresults = model.predict(frame)
# YOLOv8 returns a list of results, take the first result (since we are working with one frame)boxes = results[0].boxes
formatted_boxes = []
# Iterate over the detected boxesfor box in boxes: x1, y1, x2, y2 = box.xyxy[0].int().tolist() cls = box.cls.item() # Get class ID w = x2 - x1 h = y2 - y1 # Append the formatted box to the list if cls == 0: # Only track people formatted_boxes.append([cls, x1, y1, w, h])
# Draw the YOLOv8 bounding boxes on the frameprediction_frame = draw_detection_on_frame(frame, formatted_boxes)
# Display the ground truth and the predicted frameplt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)plt.imshow(gt_frame)plt.title('Ground Truth')plt.axis('off') # Hide axes
plt.subplot(1, 2, 2)plt.imshow(prediction_frame)plt.title('YOLOv8 Prediction')plt.axis('off')
plt.tight_layout()plt.show()
2. Object Tracking with DeepSORT
DeepSORT (Deep Simple Online and Realtime Tracking) is an object tracking algorithm that extends the original SORT (Simple Online and Realtime Tracking) algorithm with a deep association metric.
For each detected object, DeepSORT uses a neural network to extract appearance features. Using a Kalman filter, the algorithm also predicts the new locations of existing tracks in the current frame. The predicted locations and existing tracks are matched with both motion information (from the Kalman filter) and appearance information (from the feature extractor).
Refer to https://pypi.org/project/deep-sort-realtime/ for complete documentation.
###Usage
DeepSort(max_age, n_init, max_cosine_distance)
: Initialise the DeepSORT trackermax_age
: Maximum number of frames a track can be inactive before being deleted. Higher values allow for longer occlusions but may lead to ID switchesn_init
: Number of consecutive detections before the track is confirmed. Higher values increase robustness but may delay track creationmax_cosine_distance
: Threshold for feature similarity. Lower values are more strict in associating detections with existing tracks
tracker.update_tracks(detections, frame)
: Update tracks using detections for the new framedetections
: A list of new object detections in the current frame. Each detection includes:- Bounding box coordinates [x1, y1, w, h]
- Detection confidence
- Class ID
frame
: The current video frame. This is used for feature extraction- Returns the current tracks
Next, let’s use DeepSORT to track the objects detected by YOLOv8 across multiple frames.
# Use DeepSORT and results of YOLOv8 to track objectsfrom deep_sort_realtime.deepsort_tracker import DeepSortfrom tqdm import tqdm
# ConstantsCONFIDENCE_THRESHOLD = 0.5
# Initialise DeepSORT trackertracker = DeepSort(max_age=30, n_init=3, max_cosine_distance=0.3)
def generate_predictions(video_path, tracker, model): """ Run object detection and tracking, and save predictions to a file. Only tracks the people class. """ # Load the video with cv2.VideoCapture cap = cv2.VideoCapture(video_path) total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
predictions = []
# Make predictions per frame with tqdm(total=total_frames, desc="Processing frames for prediction") as pbar:
frame_id = 0
while cap.isOpened(): ret, frame = cap.read() if not ret: # Stop when there are no more frames break
frame_id += 1
# Run YOLOv8 detection frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) results = model.predict(frame_rgb, verbose=False)[0]
# Process detections for DeepSORT detections = [] for data in results.boxes.data.tolist(): confidence = data[4] # Pass over low confidence prediction if confidence < CONFIDENCE_THRESHOLD: continue # Get bounding box coordinates and class ID x1, y1, x2, y2 = data[0], data[1], data[2], data[3] class_id = data[5]
if class_id == 0: # Only track people bbox = [x1, y1, x2 - x1, y2 - y1] # Convert to [x, y, w, h] detections.append((bbox, confidence, class_id))
# Update DeepSORT tracker with the new detections tracks = tracker.update_tracks(detections, frame=frame_rgb)
# Save and output the detection results boxes = []
# Save tracks for track in tracks: if not track.is_confirmed(): continue track_id = track.track_id ltrb = track.to_ltrb() x1, y1, x2, y2 = ltrb # Left, top, right, bottom w, h = x2 - x1, y2 - y1 box = [int(track_id), int(x1), int(y1), int(w), int(h)] boxes.append(box)
# Append predictions for the current frame predictions.append(boxes)
# Update progress bar pbar.update(1)
return predictions
def generate_video_with_boxes(video_path, predictions, output_video_path): """ Generate a video with bounding boxes drawn on each frame based on predictions. """ # Read video cap = cv2.VideoCapture(video_path) width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Initialise video writer fourcc = cv2.VideoWriter_fourcc(*'mp4v') out = cv2.VideoWriter(output_video_path, fourcc, 30, (width, height))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) current_frame = 0
with tqdm(total=total_frames, desc="Generating video") as pbar:
while cap.isOpened(): ret, frame = cap.read() if not ret: break
# Get predictions for the current frame boxes = predictions[current_frame]
# Draw bounding boxes frame_with_boxes = draw_detection_on_frame(frame, boxes)
# Write the frame with boxes out.write(frame_with_boxes)
current_frame += 1 pbar.update(1)
cap.release() out.release() print(f"Video saved to {output_video_path}")
video_path = '/content/ADL-Rundle-6/ADL-Rundle-6.mp4'output_video_path = '/content/ADL-Rundle-6/ADL-Rundle-6-tracked.mp4'
# Run predictionspredictions = generate_predictions(video_path, tracker, model)
# Generate a video with bounding boxes using saved predictionsgenerate_video_with_boxes(video_path, predictions, output_video_path)
3. Performance Evaluation
In this exercise, we will:
- Match ground truth detections with predictions using the Hungarian algorithm.
- Compute various performance metrics including MOTA, MOTP, Precision, Recall, and F1-Score.
The evaluation process assesses the performance of a people tracking system by comparing predicted bounding boxes against ground truth data. The bounding boxes in the predictions will need to be matched to the corresponding boxes in the ground truth.
Steps
-
Match Detections:
- Implement the
match_detections(ground_truths, predictions, iou_threshold)
function. - Use the Hungarian algorithm (
linear_sum_assignment
) to find the optimal matching between ground truths and predictions based on IoU.
- Implement the
-
Compute Performance Metrics:
- Implement the
compute_metrics(...)
function that calculates:- MOTA (Multiple Object Tracking Accuracy): MOTA = 1 - (misses + false_positives + id_switches) / total_gt
- MOTP (Multiple Object Tracking Precision): MOTP = total_iou / number of matches
- Precision: Precision = number of matches / (number of matches + false positives)
- Recall: Recall = number of matches / total_gt
- F1 Score: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
- Implement the
-
Evaluate Tracking Performance:
- Implement the
evaluate_people_tracking(...)
function. - Iterate through frames to match detections and compute metrics.
- Implement the
Finally, adjust hyperparameters to improve your evaluation results. Download ADL-Rundle-6-tracked.mp4
and observe the results of your object tracking algorithm.
import numpy as npfrom scipy.optimize import linear_sum_assignment
def calculate_iou(box1, box2): _, x1, y1, w1, h1 = box1 _, x2, y2, w2, h2 = box2 xi1, yi1 = max(x1, x2), max(y1, y2) xi2, yi2 = min(x1 + w1, x2 + w2), min(y1 + h1, y2 + h2) intersection = max(0, xi2 - xi1) * max(0, yi2 - yi1) union = w1 * h1 + w2 * h2 - intersection return intersection / union if union > 0 else 0
def match_detections(ground_truths, predictions, iou_threshold): iou_matrix = np.zeros((len(ground_truths), len(predictions))) for i, gt in enumerate(ground_truths): for j, pred in enumerate(predictions): iou_matrix[i, j] = calculate_iou(gt, pred)
# Find the optimal matching between ground truths and predictions such that IoU is maximised matched_gt, matched_pred = linear_sum_assignment(-iou_matrix)
matches = [] unmatched_gt = [] unmatched_pred = []
for gt, pred in zip(matched_gt, matched_pred): if iou_matrix[gt, pred] >= iou_threshold: matches.append((gt, pred)) else: unmatched_gt.append(gt) unmatched_pred.append(pred)
return matches, unmatched_gt, unmatched_pred, iou_matrix
def compute_metrics(matches, misses, false_positives, id_switches, total_iou, total_gt): mota = 1 - (misses + false_positives + id_switches) / total_gt if total_gt > 0 else 0 motp = total_iou / len(matches) if len(matches) > 0 else 0
precision = len(matches) / (len(matches) + false_positives) if (len(matches) + false_positives) > 0 else 0 recall = len(matches) / total_gt if total_gt > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
return { "MOTA": mota, "MOTP": motp, "Precision": precision, "Recall": recall, "F1-Score": f1_score }
def evaluate_people_tracking(predictions, ground_truth, iou_threshold=0.5): matches = [] misses = 0 false_positives = 0 id_switches = 0 total_iou = 0
prev_matches = {}
# Assuming both `predictions` and `ground_truth` are lists of lists, we iterate through frames total_frames = min(len(ground_truth), len(predictions)) # Handle the case where one is shorter
for frame_idx in range(total_frames): gt_boxes = ground_truth[frame_idx] frame_predictions = predictions[frame_idx]
frame_matches, unmatched_gt, unmatched_pred, iou_matrix = match_detections(gt_boxes, frame_predictions, iou_threshold)
for gt, pred in frame_matches: matches.append((frame_idx, gt, pred)) total_iou += iou_matrix[gt, pred]
gt_id = gt_boxes[gt][0] pred_id = frame_predictions[pred][1] if gt_id in prev_matches and prev_matches[gt_id] != pred_id: id_switches += 1 prev_matches[gt_id] = pred_id
misses += len(unmatched_gt) false_positives += len(unmatched_pred)
total_gt = sum(len(boxes) for boxes in ground_truth)
return compute_metrics(matches, misses, false_positives, id_switches, total_iou, total_gt)
# Assuming `predictions` and `ground_truths` are both lists of listsmetrics = evaluate_people_tracking(predictions, ground_truths)
# Display evaluation resultsprint("\nMetrics:")for key, value in metrics.items(): print(f"{key}: {value}")
Metrics:MOTA: -0.1599121581153924MOTP: 0.7754386561787157Precision: 0.7690596562184024Recall: 0.7592333799161509F1-Score: 0.7641149286718908