Skip to content

3.1 Object Detection Concepts

1. Object Detection Introduction

1.1 Non-Max Suppression

Non-Max Suppression (NMS) is a technique used in computer vision to remove redundant or overlapping bounding boxes in tasks such as object detection. When multiple bounding boxes overlap and predict the same object, NMS helps select the best one and discard the rest.

Here’s how it works:

  1. Score Ranking: Sort all detected bounding boxes by their confidence scores (likelihood that they contain an object).
  2. Select Box: Take the box with the highest score as the “best” box.
  3. Remove Overlapping Boxes: Compare this box to the rest. For each remaining box, if its overlap (measured by Intersection over Union, IoU) with the “best” box exceeds a threshold, discard it.
  4. Repeat: Continue the process with the next highest-scoring box, repeating until all boxes are processed.

Example

Let’s go through an example of Non-Max Suppression (NMS) in object detection:

Input:

Suppose you detect the following 4 bounding boxes for an object (e.g., a car) with the confidence scores and IoU:

BoxCoordinates (x1, y1, x2, y2)Confidence ScoreIoU with Box A
A(50, 50, 200, 200)0.95-
B(60, 60, 210, 210)0.900.80
C(55, 55, 205, 205)0.850.85
D(300, 300, 450, 450)0.800.05

Step-by-step NMS:

  1. Rank by Confidence Score: Sort boxes based on confidence score:

    • A (0.95)
    • B (0.90)
    • C (0.85)
    • D (0.80)
  2. Select Highest Score (Box A): Choose Box A as the “best” box.

  3. Remove Overlapping Boxes: Compare the IoU of Box A with the other boxes.

    • Box B has an IoU of 0.80 (over the threshold, say 0.5) → discard Box B.
    • Box C has an IoU of 0.85 (over the threshold) → discard Box C.
    • Box D has an IoU of 0.05 (below the threshold) → keep Box D.
  4. Result: The remaining boxes are:

    • Box A: (50, 50, 200, 200) with a score of 0.95
    • Box D: (300, 300, 450, 450) with a score of 0.80

Final Output:

NMS keeps Box A and Box D, effectively removing the redundant boxes (B and C) that overlap too much with A.

Selective Search is a region proposal algorithm used in object detection to generate potential object bounding boxes from an image. It is commonly used in two-stage object detection frameworks like R-CNN (Region-based Convolutional Neural Networks) to provide object proposals for further classification and refinement. The goal of Selective Search is to efficiently generate a set of candidate regions that are likely to contain objects.

How Selective Search Works:

  1. Initial Segmentation:

    • The algorithm starts with an over-segmentation of the image into many small, homogeneous regions or segments using a method like Felzenszwalb’s segmentation algorithm. Each segment is a region with relatively uniform color and texture.
  2. Region Merging:

    • These segments are then merged into larger regions based on similarity criteria. The merging process is guided by multiple similarity measures, including color similarity, texture similarity, size, and location. Regions that are similar are combined to form larger regions.
  3. Bounding Box Generation:

    • For each merged region, a bounding box is created that encompasses the entire region. This process generates a large number of candidate bounding boxes that might contain objects.
  4. Hierarchical Grouping:

    • Selective Search uses a hierarchical approach to grouping regions. Initially, small segments are merged to form larger regions, and this process continues iteratively. This hierarchical grouping helps capture objects of varying sizes and shapes.
  5. Diverse Proposals:

    • The algorithm generates a diverse set of proposals by varying the merging criteria and similarity measures. This diversity helps in capturing a wide range of possible object shapes and locations.

Key Features:

  • Multi-scale and Multi-layer: Selective Search operates at multiple scales and layers to generate proposals of different sizes and aspect ratios.
  • Similarity Measures: Combines various similarity metrics, including color, texture, and size, to improve the quality of proposed regions.
  • Efficient: While it generates a large number of proposals, it is designed to be computationally feasible by focusing on merging segments based on similarity rather than exhaustive search.

Advantages:

  • Accuracy: Provides high-quality region proposals that are useful for object classification and bounding box regression.
  • Flexibility: Works well with objects of varying sizes and shapes.

Disadvantages:

  • Computational Cost: Can be slower than modern methods, especially in large images or when many proposals are required.
  • Redundancy: May generate overlapping or redundant proposals, which can require additional post-processing steps like Non-Max Suppression (NMS) to eliminate.

Example Workflow:

  1. Image Segmentation: The image is segmented into many small regions.
  2. Region Merging: Similar regions are merged to form larger candidate regions.
  3. Bounding Box Creation: Each merged region is bounded by a box.
  4. Proposal Generation: A diverse set of bounding box proposals is generated.

1.2.1 Selective Search Algorithm

Selective Search is a region proposal algorithm that combines segmentation and merging techniques to generate candidate object regions. The algorithm involves several steps, each with its own mathematical and algorithmic components. Here’s a detailed breakdown of the algorithm:

1. Image Segmentation

Segmentation Algorithm: Selective Search starts with segmenting the image into many small regions using a segmentation algorithm like Felzenszwalb’s segmentation.

  • Input: An image .
  • Output: A set of segments .

Segmentation Objective:

  • Segment the image into regions where each region is a contiguous area of the image with similar properties (color, texture).

2. Region Merging

Similarity Measures: Regions are merged based on a combination of similarity metrics. The similarity measure between two regions and is computed using several criteria:

  • Color Similarity: Difference in color histograms between regions.
  • Texture Similarity: Difference in texture features (e.g., gradient histograms).
  • Size and Location: Relative sizes and positions of the regions.

Similarity Function:

where , , and are weights for each similarity measure.

Merging Algorithm:

  1. Compute similarity scores for all pairs of regions.
  2. Merge the pair of regions with the highest similarity score.
  3. Update the similarity scores after each merge.
  4. Repeat until no more regions can be merged or a predefined stopping criterion is met.

3. Bounding Box Generation

Bounding Box Calculation: For each merged region , compute the bounding box that encompasses the entire region.

  • Bounding Box Coordinates:
    • Let be a region with pixels .
    • Compute the bounding box coordinates as:

Output: A set of bounding boxes for each region.

4. Hierarchical Grouping

Hierarchical Merging: The merging process is hierarchical, involving multiple levels of region merging.

  • Initial Segments: Start with the smallest segments.
  • Iterative Merging: In each iteration, merge similar segments or regions to form larger regions.

Hierarchical Grouping Process:

  1. Perform merging at a coarse level (large regions).
  2. Refine merging at finer levels (smaller regions).
  3. Continue until the desired number of regions or bounding boxes is achieved.

Example Algorithmic Steps

  1. Segment the image into initial regions.
  2. Compute similarity between all pairs of regions.
  3. Merge regions with the highest similarity score iteratively.
  4. Generate bounding boxes for the final regions.

2. Object Detection Categorization

In object detection, various methods are employed to detect and localize objects in images. These methods can be broadly categorized into Anchor-Based, Anchor-Free, and Transformer-Based detectors. Here’s a brief overview of these types:

2.1 Anchor-Based Detectors

Anchor-based detectors use predefined “anchors” (reference boxes) to detect objects. These anchors are spread across the image grid at various scales and aspect ratios, and the network predicts the offset from these anchors to create the final bounding box.

Examples:

  • Faster R-CNN: Uses a Region Proposal Network (RPN) to generate anchor boxes, then refines these proposals for final detection.
  • YOLO (You Only Look Once): Uses anchor boxes to predict bounding boxes and class probabilities.
  • SSD (Single Shot MultiBox Detector): Applies multiple anchors at different locations and scales across feature maps.

Pros:

  • Efficient in detecting objects with varying scales and aspect ratios.
  • Generally faster due to using pre-defined anchors.

Cons:

  • Anchor design is complex and requires careful tuning of scales and aspect ratios.
  • May generate redundant bounding boxes (requires Non-Max Suppression).

2.1.1 One-Stage Anchor-Based Detectors

One-stage detectors directly predict bounding boxes and class labels in a single step, using predefined anchors (bounding box templates) at different scales and aspect ratios across the image. These models are designed for high-speed detection, making them suitable for real-time applications.

Examples:

  • YOLO (You Only Look Once): Divides the image into a grid and assigns predefined anchor boxes to each grid cell. It predicts the class and bounding box adjustments for each anchor.
  • SSD (Single Shot MultiBox Detector): Uses anchor boxes of different sizes and aspect ratios at each feature map location, making it capable of detecting objects at different scales.
  • RetinaNet: Combines one-stage speed with anchor-based bounding box prediction, using focal loss to address the class imbalance issue.

Characteristics:

  • Speed: Fast and efficient, capable of real-time detection.
  • Architecture: Single forward pass through the network to predict bounding boxes and classes directly from image features.
  • Use cases: Suitable for real-time systems like autonomous driving, video surveillance, and robotics.

2.1.2 Two-Stage Anchor-Based Detectors

Two-stage detectors involve two steps: First, they generate object region proposals using predefined anchor boxes, and then they refine these proposals by classifying the object and adjusting the bounding boxes in the second stage. This approach tends to be more accurate but slower compared to one-stage detectors.

Examples:

  • Faster R-CNN: The first stage uses a Region Proposal Network (RPN) to generate anchor-based region proposals. In the second stage, these proposals are refined and classified.
  • R-FCN (Region-based Fully Convolutional Networks): Similar to Faster R-CNN, R-FCN generates proposals in the first stage and applies position-sensitive score maps to classify and refine them in the second stage.
  • Mask R-CNN: An extension of Faster R-CNN, it not only detects objects but also predicts segmentation masks. It follows the same two-stage anchor-based detection approach.

Characteristics:

  • Accuracy: Typically more accurate due to the refinement process in the second stage.
  • Speed: Slower than one-stage detectors due to the region proposal generation and refinement stages.
  • Architecture: First stage generates region proposals using anchor boxes, and the second stage refines and classifies these proposals.
  • Use cases: Suitable for applications requiring high accuracy, such as medical imaging, autonomous systems, and complex object detection tasks.

Comparison:

TypeExamplesProsCons
One-Stage Anchor-BasedYOLO, SSD, RetinaNetFast, efficient, real-time detectionLower accuracy, especially for small objects
Two-Stage Anchor-BasedFaster R-CNN, R-FCN, Mask R-CNNHigh accuracy, effective for complex scenesSlower, more computationally expensive

Key Differences:

  • One-Stage detectors aim for real-time performance by predicting bounding boxes and object classes in a single step.
  • Two-Stage detectors prioritize accuracy by first generating object proposals and then refining them, but this comes at the cost of speed.

Both types use anchor boxes but differ in how they handle object proposal and classification.

2.2 Anchor-Free Detectors

Anchor-free detectors do not rely on predefined anchor boxes. Instead, they directly predict the center points of objects, along with the height and width of the bounding boxes. This approach is simpler since it avoids the overhead of generating anchors.

Examples:

  • CenterNet: Detects object centers and predicts the size of the bounding box from the center point.
  • FCOS (Fully Convolutional One-Stage Object Detector): Predicts bounding boxes by directly regressing the object’s center points and boundaries.

Pros:

  • Simpler and faster due to the absence of anchor generation.
  • No need for manually designing anchor boxes or setting scales.

Cons:

  • May struggle with detecting small or overlapping objects.
  • Sometimes less accurate than anchor-based detectors for complex object shapes.

2.2.1 One-Stage Anchor-Free Detectors

One-stage detectors perform object detection in a single step, where the model simultaneously predicts object locations (bounding boxes) and classifies the objects. These detectors are faster and more efficient because they skip the region proposal stage, directly predicting bounding boxes from image features.

Examples:

  • FCOS (Fully Convolutional One-Stage Object Detection): A pure anchor-free detector that directly predicts the center points and bounding box coordinates without the need for predefined anchors.
  • CenterNet: Detects objects by predicting the center of an object and then estimating the size of the bounding box.
  • ExtremeNet: Predicts the extreme points (top, bottom, left, right) of an object and constructs the bounding box around these points.

Characteristics:

  • Speed: Generally faster since they combine both detection and classification in a single pass.
  • Architecture: No region proposal step, direct bounding box regression.
  • Use cases: Ideal for real-time object detection scenarios.

2.2.2 Two-Stage Anchor-Free Detectors

Two-stage detectors, even in the anchor-free category, follow a two-step approach. In the first stage, potential object regions (proposals) are generated. In the second stage, these proposals are refined and classified. However, instead of relying on anchors for region proposals, anchor-free two-stage detectors use other strategies, such as keypoints or center-based mechanisms.

Examples:

  • RepPoints: In the first stage, it predicts a set of points that represent the object shape, and in the second stage, these points are refined to form bounding boxes.
  • FSAF (Feature Selective Anchor-Free Module): Although typically paired with anchor-based methods like RetinaNet, it has an anchor-free head that refines detections in a second stage.

Characteristics:

  • Accuracy: Typically more accurate than one-stage detectors, especially for complex scenes or small objects.
  • Speed: Slower than one-stage detectors due to the two-step process.
  • Architecture: Involves region proposal and refinement stages, but without predefined anchors.

Comparison:

TypeExamplesProsCons
One-Stage Anchor-FreeFCOS, CenterNet, ExtremeNetFast, simpler architectureGenerally less accurate for small/dense objects
Two-Stage Anchor-FreeRepPoints, FSAFMore accurate, better for complex scenesSlower, more complex architecture

While most anchor-free detectors focus on being one-stage for efficiency, some two-stage approaches exist to improve detection accuracy, especially for objects that are small, overlapping, or in cluttered scenes.

2.3 Transformer-Based Detectors

Transformer-based detectors leverage transformer architectures, which have shown significant performance in natural language processing (NLP) and computer vision. These detectors treat object detection as a set-to-set prediction task and don’t rely on predefined anchors or sliding windows.

Examples:

  • DETR (Detection Transformer): Uses a transformer to model object detection as a direct set prediction problem, removing the need for NMS and anchor boxes.
  • DETR 2.0: An improvement on DETR, focusing on training efficiency and performance.

Pros:

  • Simpler architecture without post-processing like Non-Max Suppression.
  • Powerful in modeling global context, improving performance on complex and cluttered scenes.

Cons:

  • Slower convergence during training compared to CNN-based models.
  • Transformer models are often more computationally expensive.

References

https://www.ultralytics.com/hub