A fine ganularity object-level representation for event detection and recounting

Multimedia events such as "birthday party" usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHANG, Hao, NGO, Chong-wah
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2019
Subjects:	Multimedia event detection and recounting object encoding search result reasoning Computer Sciences Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/6419
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-7422
record_format	dspace
spelling	sg-smu-ink.sis_research-74222021-11-23T01:36:02Z A fine ganularity object-level representation for event detection and recounting ZHANG, Hao NGO, Chong-wah Multimedia events such as "birthday party" usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to tag an individual video frame with object labels, represented as a vector signifying probabilities of object appearances. These vectors are then pooled across frames to obtain a video-level representation. The current practices suffer from two deficiencies due to the direct employment of deep convolutional neural network (DCNN) and standard feature pooling techniques. First, the use of max-pooling and softmax layers in DCNN overemphasize the primary object or scene in a frame, producing a sparse vector that overlooks the existence of secondary or small-size objects. Second, feature pooling by max or average operator over sparse vectors makes the video-level feature unpredictable in modeling the object composition of an event. To address these problems, this paper proposes a new video representation, named Object-VLAD, which treats each object equally and encodes them into a vector for multimedia event detection. Furthermore, the vector can be flexibly decoded to identify evidences such as key objects to recount the reason why a video is retrieved for an event of interest. Experiments conducted on MED13 and MED14 datasets verify the merit of Object-VLAD by consistently outperforming several state-of-the-arts in both event detection and recounting. 2019-06-01T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/6419 info:doi/10.1109/TMM.2018.2884478 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Multimedia event detection and recounting object encoding search result reasoning Computer Sciences Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Multimedia event detection and recounting object encoding search result reasoning Computer Sciences Graphics and Human Computer Interfaces
spellingShingle	Multimedia event detection and recounting object encoding search result reasoning Computer Sciences Graphics and Human Computer Interfaces ZHANG, Hao NGO, Chong-wah A fine ganularity object-level representation for event detection and recounting
description	Multimedia events such as "birthday party" usually involve the complex interaction between humans and objects. Unlike actions and sports, these events rarely contain unique motion patterns to be vividly explored for recognition. To encode rich objects in the events, a common practice is to tag an individual video frame with object labels, represented as a vector signifying probabilities of object appearances. These vectors are then pooled across frames to obtain a video-level representation. The current practices suffer from two deficiencies due to the direct employment of deep convolutional neural network (DCNN) and standard feature pooling techniques. First, the use of max-pooling and softmax layers in DCNN overemphasize the primary object or scene in a frame, producing a sparse vector that overlooks the existence of secondary or small-size objects. Second, feature pooling by max or average operator over sparse vectors makes the video-level feature unpredictable in modeling the object composition of an event. To address these problems, this paper proposes a new video representation, named Object-VLAD, which treats each object equally and encodes them into a vector for multimedia event detection. Furthermore, the vector can be flexibly decoded to identify evidences such as key objects to recount the reason why a video is retrieved for an event of interest. Experiments conducted on MED13 and MED14 datasets verify the merit of Object-VLAD by consistently outperforming several state-of-the-arts in both event detection and recounting.
format	text
author	ZHANG, Hao NGO, Chong-wah
author_facet	ZHANG, Hao NGO, Chong-wah
author_sort	ZHANG, Hao
title	A fine ganularity object-level representation for event detection and recounting
title_short	A fine ganularity object-level representation for event detection and recounting
title_full	A fine ganularity object-level representation for event detection and recounting
title_fullStr	A fine ganularity object-level representation for event detection and recounting
title_full_unstemmed	A fine ganularity object-level representation for event detection and recounting
title_sort	fine ganularity object-level representation for event detection and recounting
publisher	Institutional Knowledge at Singapore Management University
publishDate	2019
url	https://ink.library.smu.edu.sg/sis_research/6419
_version_	1770575957125496832

A fine ganularity object-level representation for event detection and recounting

Similar Items