unbalanced data for object detection

1 view (last 30 days)
Christos
Christos on 17 Jul 2023
Answered: Sahaj on 17 Jul 2023
Hi,
I have collected data (bboxes) of multiclass (2 classes let say 'cats' and 'dogs') object detection with Image labeler and I am going to collect more. I labeled the majority of object instances to my images and at the end the instances of 'cats' are twice the 'dogs'. There are images that have only 'cats', image that have only 'dogs' and images that have both. So my question is how I treat, practically in MATLAB this imbalance problem before CNN training.
The most obvious (and easy) to me is the Hard Sampling approach i.e. a) by removing images with only ‘cats’, b) deleting ‘cats’ boxes from images and c) identify only ‘dogs’ to new images. The above approaches can be accomplished by hand by me in Image labeler.
Another approach could be the above but through a random selection/deletion. How can this accomplished in MATLAB given a groundTruth object of the above imbalanced dataset?
Finally, I am thinking of applying the augmentation only to ‘cats’ class in order to have at the end a balanced dataset. Again how can this accomplished in MATLAB given a groundTruth object of the above imbalanced dataset?
Any help would be appreciated. I have 2022b but i can install newest if needed
C.

Answers (1)

Sahaj
Sahaj on 17 Jul 2023
Hi Christos.
To address the class imbalance problem in your object detection dataset, here are a few approaches you can consider:
1) Hard Sampling Approach: As you mentioned, you can remove images with only 'cats' and delete 'cats' bounding boxes from images. To achieve this in MATLAB, you can use the following steps:
  • Load your groundTruth object
  • Iterate through the groundTruth object and identify images with only 'cats' or only 'dogs'.
  • Remove those images from the groundTruth object using the removeImages function.
  • Iterate through the bounding box data and remove 'cats' bounding boxes from images. You can use the following function: __ = bboxerase(__,EraseThreshold=threshold), which specifies the threshold for the amount of overlap between a bounding box region and the specified region-of-interest. A bounding box is removed if the overlap between the bounding box region and the region-of-interest is equal to or greater than the specified threshold.
  • Save the modified groundTruth object for further processing.
2) Random Selection/Deletion: Instead of manually selecting images to remove, you can randomly select a subset of images for each class to achieve a more balanced dataset. Here's how you can do it:
  • Load your groundTruth object.
  • Randomly select a subset of images with 'cats' and 'dogs' using the randperm function.
  • Remove the remaining images from the groundTruth object using the removeImages function.
  • Save the modified groundTruth object.
3) Augmentation: Applying augmentation techniques specifically to the 'cats' class can help balance the dataset. MATLAB provides the imageDataAugmenter function to perform data augmentation. Here's an example of how you can apply augmentation to the 'cats' class:
  • Load your groundTruth object.
  • Separate the 'cats' bounding boxes from the 'dogs' bounding boxes using the bboxerase function.
  • Create an imageDataAugmenter object and specify the desired augmentation techniques (e.g., rotation, scaling, flipping) using the augment function.
  • Apply augmentation only to the 'cats' bounding boxes using the augmentData function.
  • Merge the augmented 'cats' bounding boxes with the original 'dogs' bounding boxes.
  • Save the modified groundTruth object.
Hope this helps.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!