Weakly-supervised semantic segmentation
Semantic segmentation is a fundamental task in computer vision that assigns a label to every pixel in an image based on the semantic meaning of the objects present. It demands a large amount of pixel-level labeled images for training deep models. Weakly-supervised semantic segmentation (WSSS) is a m...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/etd_coll/544 https://ink.library.smu.edu.sg/context/etd_coll/article/1542/viewcontent/GPIS_AY2019_PhD_CHEN_Zhaozheng.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
Summary: | Semantic segmentation is a fundamental task in computer vision that assigns a label to every pixel in an image based on the semantic meaning of the objects present. It demands a large amount of pixel-level labeled images for training deep models. Weakly-supervised semantic segmentation (WSSS) is a more feasible approach that uses only weak annotations to learn the segmentation task. Image-level label based WSSS is the most challenging and popular, where only the class label for the entire image is provided as supervision. To address this challenge, Class Activation Map (CAM) has emerged as a powerful technique in WSSS. CAM provides a way to visualize the areas of an image that are most relevant to a particular class without requiring pixel-level annotations. However, CAM is generated from the classification model, and it often only highlights the most discriminative parts of the object due to the discriminative nature of the model.
This dissertation examines the key issues behind conventional CAM and proposes corresponding solutions. Two of our completed works focus on two crucial steps in CAM generation: training a classification model and computing CAM from the classification model. The first work discusses the disadvantage of a key component to training a good classification model — binary cross-entropy (BCE) loss function. We introduce a simple method: reactivating the converged CAM with BCE by using softmax cross-entropy loss (SCE). Thanks to the contrastive nature of SCE, the pixel response is disentangled into different classes, and hence less mask ambiguity is expected. Then, in our second completed work, we aim to improve the quality of CAM given a trained classification model. Specifically, we introduce a new computation method for CAM that captures non-discriminative features, resulting in expanded CAM coverage to cover whole objects. This is achieved by clustering on all local features of an object class to derive local prototypes, representing local semantics such as the “head”, “leg”, and “body” of a “sheep”. Our CAM captures all local features of the class without discrimination.
Although the two completed works have brought significant improvements to conventional CAM, the improved CAM may still face a bottleneck due to the limited training data and the co-occurrence of objects and backgrounds. In this dissertation, we investigate the applicability of the recent visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. SAM is a recent image segmentation model exhibiting superior performance across various segmentation tasks. It is remarkable for its capability to interpret diverse prompts and successively generate various object masks. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning, and we propose related pipelines for its application in WSSS. We provide insights into the potential and challenges of deploying visual foundation models for WSSS, facilitating future developments in this exciting research area. |
---|