Tools for visual scene recognition

Scene recognition is an important step towards a full understanding of an image. This thesis presents novel ideas related to semantic-spatial content capture and local-global feature fusion techniques and applies them for scene recognition. It shows how the proper use of these approaches, without tr...

Full description

Saved in:
Bibliographic Details
Main Author: Elahe Farahzadeh
Other Authors: Andrzej Sluzek
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/59540
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-59540
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Elahe Farahzadeh
Tools for visual scene recognition
description Scene recognition is an important step towards a full understanding of an image. This thesis presents novel ideas related to semantic-spatial content capture and local-global feature fusion techniques and applies them for scene recognition. It shows how the proper use of these approaches, without trying to recognize objects in the scene images, can lead to improvement in recognition accuracy for scene classi cation. First, we propose a method to build a semantic visual vocabulary. The features are extracted from the image patches and the initial vocabulary is constructed by performing k-means clustering on the extracted features and choosing the cluster centers as the visual words. The feature vectors are quantized based on the initial vocabulary to form a wordimage matrix that describes the occurrence of words in the images. The codebooks are then embedded into the concept space by latent semantic models. We demonstrate this embedding using Latent Semantic Analysis (LSA) as well as Probabilistic Latent Semantic Analysis (pLSA). In the proposed space, the distances between words represent the semantic distances, which are used to construct a discriminative and semantically meaningful vocabulary. The main contributions of the rst chapter are as follows: 1. Using semantic word space to co-cluster similar words together to form a semantic visual vocabulary. This will improve the results compared to other methods that use document space directly after pLSA embedding. 2. Investigating changes in the number of latent variables. 3. Using LSA embedding when all other vision systems to date only use pLSA. This method has shown promising results on 15-Scene categories when the proposed model extracts one type of visual feature. Second, since fusing local and global features is bene cial for achieving a promising performance for scene categorization systems [7], we propose a novel Local-Global Feature Fusion (LGFF) method with the capability to fuse latent semantic patches adaptively. The image local feature space is embedded into the latent semantic space by employing pLSA modeling; afterwards, this semantic space and the global contextual feature space are mapped into a kernel-space. To perform this embedding, the global features and latent variables are relatively weighted for each scene category. The following is a summary of the main contributions of the second chapter: 1. Weighting latent semantic topics based on their discriminative power 2. De ning a novel exemplar-based distance learning 3. De ning a category-dependent map function; the experimental evaluations indicate radical improvements in the 15-Scene and 67-Indoor Scenes datasets. Third, every scene image contains a high value of spatial information. Capturing the spatial positions of visual features plays an essential role in scene categorization; however, the methods proposed in the previous chapters disregard this position information. Inspired by methods that construct pyramid levels over image primitives [8, 9], pyramid matching kernels are employed to measure the dissimilarity scores of image pyramids. We improved the semantic vocabulary framework by considering the position of semantic local patches and their surrounding neighborhood properties using either the global or region-based spatial pyramid method. In the global method, after projecting the image into the concept space, it is divided into ner sub-scenes and then the pyramid match kernels are employed over the proposed space for co-clustering the semantic visual words. The region-based method initially divides the image into sub-scenes and then projects each sub-scene into the concept space. In the Local-Global Feature Fusion framework, by using the global features gist and CENsus TRansform hISTogram (CENTRIST), the global spatial layout of the scene is already captured and, to further improve the results, the image is divided into sub-regions at di erent levels of resolution. The pyramid matching kernels are then applied over these sub-regions. The representation of these sub-regions is obtained either by applying CENTRIST or a bag of Scale-Invariant Feature Transform (SIFT) visual features. The experimental results outperform most of the best published results for both the 15-Scene and 67-Indoor Scenes datasets. This thesis concludes with a discussion of future directions for extending the proposed works on scene recognition.
author2 Andrzej Sluzek
author_facet Andrzej Sluzek
Elahe Farahzadeh
format Theses and Dissertations
author Elahe Farahzadeh
author_sort Elahe Farahzadeh
title Tools for visual scene recognition
title_short Tools for visual scene recognition
title_full Tools for visual scene recognition
title_fullStr Tools for visual scene recognition
title_full_unstemmed Tools for visual scene recognition
title_sort tools for visual scene recognition
publishDate 2014
url https://hdl.handle.net/10356/59540
_version_ 1759856957960749056
spelling sg-ntu-dr.10356-595402023-03-04T00:48:57Z Tools for visual scene recognition Elahe Farahzadeh Andrzej Sluzek Cham Tat Jen School of Computer Engineering Centre for Computational Intelligence DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Scene recognition is an important step towards a full understanding of an image. This thesis presents novel ideas related to semantic-spatial content capture and local-global feature fusion techniques and applies them for scene recognition. It shows how the proper use of these approaches, without trying to recognize objects in the scene images, can lead to improvement in recognition accuracy for scene classi cation. First, we propose a method to build a semantic visual vocabulary. The features are extracted from the image patches and the initial vocabulary is constructed by performing k-means clustering on the extracted features and choosing the cluster centers as the visual words. The feature vectors are quantized based on the initial vocabulary to form a wordimage matrix that describes the occurrence of words in the images. The codebooks are then embedded into the concept space by latent semantic models. We demonstrate this embedding using Latent Semantic Analysis (LSA) as well as Probabilistic Latent Semantic Analysis (pLSA). In the proposed space, the distances between words represent the semantic distances, which are used to construct a discriminative and semantically meaningful vocabulary. The main contributions of the rst chapter are as follows: 1. Using semantic word space to co-cluster similar words together to form a semantic visual vocabulary. This will improve the results compared to other methods that use document space directly after pLSA embedding. 2. Investigating changes in the number of latent variables. 3. Using LSA embedding when all other vision systems to date only use pLSA. This method has shown promising results on 15-Scene categories when the proposed model extracts one type of visual feature. Second, since fusing local and global features is bene cial for achieving a promising performance for scene categorization systems [7], we propose a novel Local-Global Feature Fusion (LGFF) method with the capability to fuse latent semantic patches adaptively. The image local feature space is embedded into the latent semantic space by employing pLSA modeling; afterwards, this semantic space and the global contextual feature space are mapped into a kernel-space. To perform this embedding, the global features and latent variables are relatively weighted for each scene category. The following is a summary of the main contributions of the second chapter: 1. Weighting latent semantic topics based on their discriminative power 2. De ning a novel exemplar-based distance learning 3. De ning a category-dependent map function; the experimental evaluations indicate radical improvements in the 15-Scene and 67-Indoor Scenes datasets. Third, every scene image contains a high value of spatial information. Capturing the spatial positions of visual features plays an essential role in scene categorization; however, the methods proposed in the previous chapters disregard this position information. Inspired by methods that construct pyramid levels over image primitives [8, 9], pyramid matching kernels are employed to measure the dissimilarity scores of image pyramids. We improved the semantic vocabulary framework by considering the position of semantic local patches and their surrounding neighborhood properties using either the global or region-based spatial pyramid method. In the global method, after projecting the image into the concept space, it is divided into ner sub-scenes and then the pyramid match kernels are employed over the proposed space for co-clustering the semantic visual words. The region-based method initially divides the image into sub-scenes and then projects each sub-scene into the concept space. In the Local-Global Feature Fusion framework, by using the global features gist and CENsus TRansform hISTogram (CENTRIST), the global spatial layout of the scene is already captured and, to further improve the results, the image is divided into sub-regions at di erent levels of resolution. The pyramid matching kernels are then applied over these sub-regions. The representation of these sub-regions is obtained either by applying CENTRIST or a bag of Scale-Invariant Feature Transform (SIFT) visual features. The experimental results outperform most of the best published results for both the 15-Scene and 67-Indoor Scenes datasets. This thesis concludes with a discussion of future directions for extending the proposed works on scene recognition. DOCTOR OF PHILOSOPHY (SCE) 2014-05-07T08:49:20Z 2014-05-07T08:49:20Z 2014 2014 Thesis Elahe Farahzadeh. (2014). Tools for visual scene recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/59540 10.32657/10356/59540 en 119 p. application/pdf