Image segmentation with less manual labeling effort
Semantic segmentation is a task that classifies each pixel into a particular class. With the help of deep learning, fully supervised segmentation has achieved remarkable performance. However, fully supervised learning has critical intrinsic limitations, which is that it often requires a prohibitivel...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/162800 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Semantic segmentation is a task that classifies each pixel into a particular class. With the help of deep learning, fully supervised segmentation has achieved remarkable performance. However, fully supervised learning has critical intrinsic limitations, which is that it often requires a prohibitively large number of pixel-level annotated images for model training. Collecting the labeled data can be notoriously expensive in dense prediction tasks like semantic segmentation, instance segmentation, and video segmentation. To alleviate or even free researchers from the high cost of laborious annotations, this thesis tackles the problem mentioned above from two aspects: few-shot segmentation and weakly supervised segmentation. Few-shot segmentation is proposed to learn a network to predict segmentation masks for the novel classes with only a few newly annotated training samples. On the other hand, weakly supervised segmentation is proposed to learn a pixel-level network with weaker annotations. The annotations can be obtained in a much-eased manner, such as bounding boxes, scribbles, image labels, and points, rather than labeling all pixels in an image. In the first aspect, we aim to improve the few-shot segmentation performance with the following innovations:
Firstly, we propose a Cross-Reference and Local-Global Condition Network (CRCNet) to concurrently make predictions for both the support image and the query image to mine out the same category objects for the few-shot segmentation. To further improve object feature representation, we develop a local-global condition module to capture both global and local relations.
As there is a massive variance in the object appearances, mining foreground regions in images can be multi-step. We also develop a mask refinement module to refine the prediction of the target object regions recurrently.
After that, we propose a Query Guided Network (QGNet) to extract the information from the query itself independently to benefit the few-shot segmentation task. We propose a prior extractor to learn the query information from the unlabeled images with our proposed global-local contrastive learning. With the prior extractor, the extraction of query information is detached from the support branch, overcoming the limitation by support, and could obtain more informative query clues to achieve better interaction.
In the second aspect, we focus on weakly-supervised segmentation, aiming to predict the pixel-level mask by learning a network supervised with the image-level annotation.
The quality of the Class Activation Maps (CAMs) has a crucial impact on the performance of the weakly supervised segmentation model. Weakly supervised image segmentation trained with image-level labels usually suffers from inaccurate coverage of object areas during the generation of the pseudo groundtruth. This is because the CAMs are trained with the classification objective and lack the ability to generalize.
We aim to improve the quality of CAMs to improve the weakly-supervised segmentation performance from different aspects.
\vspace{-0.05cm}
Firstly, we will discuss using a bipartite graph to locate the object-activated areas in two images containing common classes. The matching areas are then used to refine the predicted object regions in the CAMs.
In particular, we propose the maximum bipartite matching network (MBMNet) to map the paired images with a bipartite graph. Then we utilize the maximum matching algorithm to locate corresponding areas in the paired images. The matching areas are used to enhance the corresponded feature representations. Based on the enhanced feature representations, we can generate better CAMs with more object regions involved.
Finally, we propose a region prototypical network (RPNet) to explore the cross-image object diversity of the training set to enhance the object activated maps for weakly supervised segmentation. Similar object parts across images are identified via region feature comparison. Object confidence is propagated between regions to discover and re-activate new object areas while background regions are suppressed.
We aim to obtain a more complete pseudo ground truth for the weakly supervised segmentation based on the re-activated feature maps.
In summary, with CRCNet and QGNet, we improved the few-shot segmentation performance with a cross-reference mechanism and global-local contrastive learning. With our proposed MBMNet and RPNet, we enhanced the object activated maps and improved the performance of the weakly supervised segmentation by discovering new object areas. We have achieved new state-of-the-art segmentation performance on public benchmarks for both tasks. |
---|