Visual event recognition in videos by learning from web data

We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Alig...

Full description

Saved in:

Bibliographic Details
Main Authors:	Duan, Lixin, Xu, Dong, Tsang, Ivor Wai-Hung, Luo, Jiebo
Other Authors:	School of Computer Engineering
Format:	Article
Language:	English
Published:	2013
Subjects:	DRNTU::Engineering::Computer science and engineering::Data
Online Access:	https://hdl.handle.net/10356/99186 http://hdl.handle.net/10220/13518
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-99186
record_format	dspace
spelling	sg-ntu-dr.10356-991862020-05-28T07:41:41Z Visual event recognition in videos by learning from web data Duan, Lixin Xu, Dong Tsang, Ivor Wai-Hung Luo, Jiebo School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Data We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MK- using the prelearned classifiers only from each individual event class. 2013-09-18T06:36:22Z 2019-12-06T20:04:14Z 2013-09-18T06:36:22Z 2019-12-06T20:04:14Z 2012 2012 Journal Article Duan, L., Xu. D., Tsang, I. W., & Luo, J. (2012). Visual Event Recognition in Videos by Learning from Web Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667-1680. 0162-8828 https://hdl.handle.net/10356/99186 http://hdl.handle.net/10220/13518 10.1109/TPAMI.2011.265 en IEEE transactions on pattern analysis and machine intelligence © 2012 IEEE
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Data
spellingShingle	DRNTU::Engineering::Computer science and engineering::Data Duan, Lixin Xu, Dong Tsang, Ivor Wai-Hung Luo, Jiebo Visual event recognition in videos by learning from web data
description	We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MK- using the prelearned classifiers only from each individual event class.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Duan, Lixin Xu, Dong Tsang, Ivor Wai-Hung Luo, Jiebo
format	Article
author	Duan, Lixin Xu, Dong Tsang, Ivor Wai-Hung Luo, Jiebo
author_sort	Duan, Lixin
title	Visual event recognition in videos by learning from web data
title_short	Visual event recognition in videos by learning from web data
title_full	Visual event recognition in videos by learning from web data
title_fullStr	Visual event recognition in videos by learning from web data
title_full_unstemmed	Visual event recognition in videos by learning from web data
title_sort	visual event recognition in videos by learning from web data
publishDate	2013
url	https://hdl.handle.net/10356/99186 http://hdl.handle.net/10220/13518
_version_	1681059462746472448

Visual event recognition in videos by learning from web data

Similar Items