VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description

Multimedia events such as “birthday party” usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to tag an in...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHANG, Hao, LU, Yi-Jie, NGO, Chong-wah
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2016
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/6578
https://ink.library.smu.edu.sg/context/sis_research/article/7581/viewcontent/vireo_2016_ngo.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-7581
record_format dspace
spelling sg-smu-ink.sis_research-75812022-01-13T08:10:46Z VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description ZHANG, Hao LU, Yi-Jie NGO, Chong-wah Multimedia events such as “birthday party” usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to tag an individual video frame with object labels, represented as a vector signifying probabilities of object appearances. These vectors are then pooled across frames to obtain a video-level representation. The current practices suffer from two deficiencies due to the direct employment of deep convolutional neural network (DCNN) and standard feature pooling techniques. First, the use of max-pooling and softmax layers in DCNN overemphasize the primary object or scene in a frame, producing a sparse vector that overlooks the existence of secondary or small-size objects. Second, feature pooling by max or average operator over sparse vectors makes the video-level feature unpredictable in modeling the object composition of an event. To address these problems, this paper proposes a new video representation, named Object-VLAD, which treats each object equally and encodes them into a vector for multimedia event detection. Furthermore, the vector can be flexibly decoded to identify evidences such as key objects to recount the reason why a video is retrieved for an event of interest. Experiments conducted on MED13 and MED14 datasets verify the merit of Object-VLAD by consistently outperforming several state-of-the-arts in both event detection and recounting. 2016-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6578 https://ink.library.smu.edu.sg/context/sis_research/article/7581/viewcontent/vireo_2016_ngo.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Software Engineering
spellingShingle Software Engineering
ZHANG, Hao
LU, Yi-Jie
NGO, Chong-wah
VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description
description Multimedia events such as “birthday party” usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to tag an individual video frame with object labels, represented as a vector signifying probabilities of object appearances. These vectors are then pooled across frames to obtain a video-level representation. The current practices suffer from two deficiencies due to the direct employment of deep convolutional neural network (DCNN) and standard feature pooling techniques. First, the use of max-pooling and softmax layers in DCNN overemphasize the primary object or scene in a frame, producing a sparse vector that overlooks the existence of secondary or small-size objects. Second, feature pooling by max or average operator over sparse vectors makes the video-level feature unpredictable in modeling the object composition of an event. To address these problems, this paper proposes a new video representation, named Object-VLAD, which treats each object equally and encodes them into a vector for multimedia event detection. Furthermore, the vector can be flexibly decoded to identify evidences such as key objects to recount the reason why a video is retrieved for an event of interest. Experiments conducted on MED13 and MED14 datasets verify the merit of Object-VLAD by consistently outperforming several state-of-the-arts in both event detection and recounting.
format text
author ZHANG, Hao
LU, Yi-Jie
NGO, Chong-wah
author_facet ZHANG, Hao
LU, Yi-Jie
NGO, Chong-wah
author_sort ZHANG, Hao
title VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description
title_short VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description
title_full VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description
title_fullStr VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description
title_full_unstemmed VIREO @ TRECVID 2016: Multimedia event detection, ad-hoc video search, video-to-text description
title_sort vireo @ trecvid 2016: multimedia event detection, ad-hoc video search, video-to-text description
publisher Institutional Knowledge at Singapore Management University
publishDate 2016
url https://ink.library.smu.edu.sg/sis_research/6578
https://ink.library.smu.edu.sg/context/sis_research/article/7581/viewcontent/vireo_2016_ngo.pdf
_version_ 1770575994734772224