Co-saliency based visual object co-segmentation and co-localization

Automatic foreground segmentation and localization in images or videos are very important and basic problems in computer vision. Due to lacking of sufficient information about the foreground object in a single image or a video, these tasks usually become very difficult. However, if a set of similar...

Full description

Saved in:
Bibliographic Details
Main Author: Jerripothula, Koteswar Rao
Other Authors: Yuan Junsong
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/72465
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Automatic foreground segmentation and localization in images or videos are very important and basic problems in computer vision. Due to lacking of sufficient information about the foreground object in a single image or a video, these tasks usually become very difficult. However, if a set of similar images (where foreground objects are of the same category) are provided for joint processing, the job becomes little easier because we can exploit the available commonness clue. Thus, by jointly processing similar images together we provide a kind of weak supervision to the system. Such a task of segmenting out the common foreground visual objects through joint processing of similar images is known as co-segmentation. Similarly, the task of localizing (proving bounding box to) the common foreground visual objects is known as co-localization. Co-segmentation and co-localization tasks have applications in image retrieval, image synthesis, datasets generation, object recognition, video surveillance, action recognition, etc. However, such joint processing brings in new challenges to handle: (i) variation in terms of poses, sub-categories, viewpoints, etc; (ii) complexity in design;(iii) difficulty in parameter setting due to increased number of variables; (iv) the speed; and (v) their futility in some cases compared to single processing. Many existing joint processing methods usually extend the single processing methods and succumb to complicatedly co-labelling the pixels or bounding box proposals. However, co-saliency idea to effectively carry out these tasks have not been well-explored, especially co-saliency generated by fusing raw saliency maps. Co-saliency basically means jointly processed saliency. In this thesis, we present four co-saliency based works: saliency fusion, saliency co-fusion, video co-localization, and object co-skeletonization. In our saliency fusion idea, we propose to fuse the saliency maps of different images using dense correspondence technique. More importantly, this co-saliency estimation is guided by our proposed quality measurement which helps decide whether the saliency fusion improves the quality of saliency map or not. This helps us to decide which is better for a particular case: joint or single processing. Idea is that high-quality saliency map should have well-separated foreground and background, also a concentrated foreground. In our saliency co-fusion idea, to make the system more robust and to avoid heavy dependence on only a single saliency extraction method, we propose to apply multiple existing saliency extraction methods on each image to obtain diverse saliency maps and fuse them by exploiting the inter-image information, which we call saliency co-fusion. Note that while we fused saliency maps of different images in the above saliency fusion idea, we here fuse diverse saliency maps of the same image. It results in much cleaner co-saliency maps. In our video co-localization idea, in contrast to previous joint frameworks that use bounding box proposals at every frame to attack the problem, we propose to leverage co-saliency activated tracklets to address the challenges of speed and variations. We develop co-saliency maps for few key frames (which we call as activators) only through inter-video commonness, intra-video commonness, and motion saliency. Again, the saliency fusion approach is employed. Object proposals of high objectness and co-saliency scores are then tracked across the short video intervals, between key frames, to build tracklets. The best tube for a video is obtained through tracklet selection from each of these intervals depending upon confidence and smoothness between adjacent tracklets. Different from object co-segmentation and co-localization, we also explore a new joint processing idea called object co-skeletonization, which is defined as joint skeleton extraction of common objects in a set of semantically similar images. Noting that skeleton can provide good scribbles for segmentation, and skeletonization, in turn, needs good segmentation, we propose to couple co-skeletonization and co-segmentation tasks so that they are well informed of each other, and benefit each other synergistically. This coupled framework also greatly benefits from our co-saliency and fusion ideas.