ImageSpirit: Verbal guided image parsing

Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute l...

Full description

Saved in:
Bibliographic Details
Main Authors: CHENG, Ming-Ming, ZHENG, Shuai, LIN, Wen-yan, VINEET, Vibhav, STURGESS, Paul, CROOK, Nigel, MITRA, Niloy J., TORR, Philip
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2014
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4854
https://ink.library.smu.edu.sg/context/sis_research/article/5857/viewcontent/ImageSpirit__Verbal_Guided_Image_Parsing__AV.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5857
record_format dspace
spelling sg-smu-ink.sis_research-58572020-01-23T07:10:05Z ImageSpirit: Verbal guided image parsing CHENG, Ming-Ming ZHENG, Shuai LIN, Wen-yan VINEET, Vibhav STURGESS, Paul CROOK, Nigel MITRA, Niloy J. TORR, Philip Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute labels to pixels. In this article we propose treating nouns as object labels and adjectives as visual attribute labels. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution. Using the extracted labels as handles, our system empowers a user to verbally refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interest enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g., smartphones, Google Glass, livingroom devices). We demonstrate our system on a large number of real-world images with varying complexity. To help understand the trade-offs compared to traditional mouse-based interactions, results are reported for both a large-scale quantitative evaluation and a user study. 2014-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4854 info:doi/10.1145/2682628 https://ink.library.smu.edu.sg/context/sis_research/article/5857/viewcontent/ImageSpirit__Verbal_Guided_Image_Parsing__AV.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Design Human Factors Languages Image parsing natural language control speech interface object class segmentation image parsing visual attributes multilabel CRF Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Design
Human Factors
Languages
Image parsing
natural language control
speech interface
object class segmentation
image parsing
visual attributes
multilabel CRF
Graphics and Human Computer Interfaces
spellingShingle Design
Human Factors
Languages
Image parsing
natural language control
speech interface
object class segmentation
image parsing
visual attributes
multilabel CRF
Graphics and Human Computer Interfaces
CHENG, Ming-Ming
ZHENG, Shuai
LIN, Wen-yan
VINEET, Vibhav
STURGESS, Paul
CROOK, Nigel
MITRA, Niloy J.
TORR, Philip
ImageSpirit: Verbal guided image parsing
description Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute labels to pixels. In this article we propose treating nouns as object labels and adjectives as visual attribute labels. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution. Using the extracted labels as handles, our system empowers a user to verbally refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interest enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g., smartphones, Google Glass, livingroom devices). We demonstrate our system on a large number of real-world images with varying complexity. To help understand the trade-offs compared to traditional mouse-based interactions, results are reported for both a large-scale quantitative evaluation and a user study.
format text
author CHENG, Ming-Ming
ZHENG, Shuai
LIN, Wen-yan
VINEET, Vibhav
STURGESS, Paul
CROOK, Nigel
MITRA, Niloy J.
TORR, Philip
author_facet CHENG, Ming-Ming
ZHENG, Shuai
LIN, Wen-yan
VINEET, Vibhav
STURGESS, Paul
CROOK, Nigel
MITRA, Niloy J.
TORR, Philip
author_sort CHENG, Ming-Ming
title ImageSpirit: Verbal guided image parsing
title_short ImageSpirit: Verbal guided image parsing
title_full ImageSpirit: Verbal guided image parsing
title_fullStr ImageSpirit: Verbal guided image parsing
title_full_unstemmed ImageSpirit: Verbal guided image parsing
title_sort imagespirit: verbal guided image parsing
publisher Institutional Knowledge at Singapore Management University
publishDate 2014
url https://ink.library.smu.edu.sg/sis_research/4854
https://ink.library.smu.edu.sg/context/sis_research/article/5857/viewcontent/ImageSpirit__Verbal_Guided_Image_Parsing__AV.pdf
_version_ 1770575064188583936