Exploring effective data representation for saliency detection in image and video

Visual saliency plays an important role in many applications, such as image/video retargeting, automatic photo composition, vision-based navigation, etc. Visual saliency can guide these applications to only focus on the important regions, thus reduce the complexity of scene analysis. However, curren...

Full description

Saved in:
Bibliographic Details
Main Author: Ren, Zhixiang
Other Authors: Chia Liang Tien
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/55428
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-55428
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Ren, Zhixiang
Exploring effective data representation for saliency detection in image and video
description Visual saliency plays an important role in many applications, such as image/video retargeting, automatic photo composition, vision-based navigation, etc. Visual saliency can guide these applications to only focus on the important regions, thus reduce the complexity of scene analysis. However, current saliency detection methods generate saliency maps with low resolution or quality, which may not satisfy the requirements of some applications. Moreover, compared with a large amount of research efforts on static images, the saliency models for videos are less well established. In this thesis, we study and propose several models to detect salient objects or regions in images and videos. To address the low resolution problem of saliency maps, we improve the current clustering framework by introducing a two-level clustering strategy based on the complexity of images. We first use the adaptive mean shift algorithm to extract superpixels from the input image, then employ the Gaussian Mixture Model (GMM) to group superpixels based on their appearance similarity. The saliency value is finally calculated for each cluster using compactness metric together with modified PageRank propagation. With the superpixel representation and saliency refinement, this region-based method represents the input image in a perceptually meaningful way and highlights salient regions with full resolution and well-defined boundary. The application of our saliency maps in object recognition shows the potential of the proposed method. For video saliency detection, motivated by the psychological findings that human visual system is extremely sensitive to isolated abrupt stimulus and relative movement, we formulate the saliency detection problem as an unified feature reconstruction problem. For temporal saliency, we use patches in neighboring frames to sparsely reconstruct the target patch in the current frame. We measure the temporal saliency of a patch based on its abruptness, which is estimated by the reconstruction error as well as regularizer, and its motion contrast calculated as the difference of reconstruction coefficients. For spatial saliency, we use the surrounding patches in the same frame to sparsely reconstruct the center patch. The reconstruction error and regularizer are used to measure the local center-surround contrast for spatial saliency detection. The excellent performance of our feature reconstruction in both image and video evaluations justifies the plausibility of feature reconstruction as an explanation for visual saliency. The sparse and low-rank representation demonstrates great potential in subspace learning. For different camera motion, we develop different video saliency detection models based on this powerful technique. With respect to moderate camera motion, we jointly estimate the salient foreground motion and the camera motion via robust alignment with sparse and low-rank decomposition. Consecutive frames are transformed and aligned, and then decomposed to a low-rank matrix representing the background and a sparse matrix indicating the objects with salient motion. We also incorporate useful spatial information including global rarity, local center-surround contrast and location priority, into our model to comprehensively detect spatiotemporal saliency. With regards to large camera motion, our alignment-based model may fail to detect moving objects. Thus we propose to use trajectory representation in the sparse and low-rank decomposition for videos with large camera motion. Under the assumption of orthographic projection, the trajectories from background lie in a subspace spanned by three basis trajectories, i.e. the rank of the background matrix is 3. We estimate the compact background model based on this rank constraint. Furthermore, to enforce the spatial connectivity and motion coherency constraint, a Markov Random Field (MRF) is built for foreground estimation. This model is evaluated on a set of challenging sequences and shows superior performance compared to several state-of-the-art methods.
author2 Chia Liang Tien
author_facet Chia Liang Tien
Ren, Zhixiang
format Theses and Dissertations
author Ren, Zhixiang
author_sort Ren, Zhixiang
title Exploring effective data representation for saliency detection in image and video
title_short Exploring effective data representation for saliency detection in image and video
title_full Exploring effective data representation for saliency detection in image and video
title_fullStr Exploring effective data representation for saliency detection in image and video
title_full_unstemmed Exploring effective data representation for saliency detection in image and video
title_sort exploring effective data representation for saliency detection in image and video
publishDate 2014
url http://hdl.handle.net/10356/55428
_version_ 1759854480858284032
spelling sg-ntu-dr.10356-554282023-03-04T00:34:46Z Exploring effective data representation for saliency detection in image and video Ren, Zhixiang Chia Liang Tien School of Computer Engineering Centre for Multimedia and Network Technology DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Visual saliency plays an important role in many applications, such as image/video retargeting, automatic photo composition, vision-based navigation, etc. Visual saliency can guide these applications to only focus on the important regions, thus reduce the complexity of scene analysis. However, current saliency detection methods generate saliency maps with low resolution or quality, which may not satisfy the requirements of some applications. Moreover, compared with a large amount of research efforts on static images, the saliency models for videos are less well established. In this thesis, we study and propose several models to detect salient objects or regions in images and videos. To address the low resolution problem of saliency maps, we improve the current clustering framework by introducing a two-level clustering strategy based on the complexity of images. We first use the adaptive mean shift algorithm to extract superpixels from the input image, then employ the Gaussian Mixture Model (GMM) to group superpixels based on their appearance similarity. The saliency value is finally calculated for each cluster using compactness metric together with modified PageRank propagation. With the superpixel representation and saliency refinement, this region-based method represents the input image in a perceptually meaningful way and highlights salient regions with full resolution and well-defined boundary. The application of our saliency maps in object recognition shows the potential of the proposed method. For video saliency detection, motivated by the psychological findings that human visual system is extremely sensitive to isolated abrupt stimulus and relative movement, we formulate the saliency detection problem as an unified feature reconstruction problem. For temporal saliency, we use patches in neighboring frames to sparsely reconstruct the target patch in the current frame. We measure the temporal saliency of a patch based on its abruptness, which is estimated by the reconstruction error as well as regularizer, and its motion contrast calculated as the difference of reconstruction coefficients. For spatial saliency, we use the surrounding patches in the same frame to sparsely reconstruct the center patch. The reconstruction error and regularizer are used to measure the local center-surround contrast for spatial saliency detection. The excellent performance of our feature reconstruction in both image and video evaluations justifies the plausibility of feature reconstruction as an explanation for visual saliency. The sparse and low-rank representation demonstrates great potential in subspace learning. For different camera motion, we develop different video saliency detection models based on this powerful technique. With respect to moderate camera motion, we jointly estimate the salient foreground motion and the camera motion via robust alignment with sparse and low-rank decomposition. Consecutive frames are transformed and aligned, and then decomposed to a low-rank matrix representing the background and a sparse matrix indicating the objects with salient motion. We also incorporate useful spatial information including global rarity, local center-surround contrast and location priority, into our model to comprehensively detect spatiotemporal saliency. With regards to large camera motion, our alignment-based model may fail to detect moving objects. Thus we propose to use trajectory representation in the sparse and low-rank decomposition for videos with large camera motion. Under the assumption of orthographic projection, the trajectories from background lie in a subspace spanned by three basis trajectories, i.e. the rank of the background matrix is 3. We estimate the compact background model based on this rank constraint. Furthermore, to enforce the spatial connectivity and motion coherency constraint, a Markov Random Field (MRF) is built for foreground estimation. This model is evaluated on a set of challenging sequences and shows superior performance compared to several state-of-the-art methods. Doctor of Philosophy (SCE) 2014-03-06T09:00:46Z 2014-03-06T09:00:46Z 2013 2013 Thesis http://hdl.handle.net/10356/55428 en 168 p. application/pdf