Multimodal distillation for egocentric video understanding

Advancements in smart devices, especially head-mounted wearables, create new egocentric video applications, leading to enormous multimodal egocentric scenarios. These days, multimodal egocentric video understanding has wide applications in augmented reality, education, and industries. Knowledge dis...

Full description

Saved in:

Bibliographic Details
Main Author:	Peng, Han
Other Authors:	Alex Chichung Kot
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Engineering Knowledge distillation Egocentric video understanding Transformed teacher matching
Online Access:	https://hdl.handle.net/10356/177296
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-177296
record_format	dspace
spelling	sg-ntu-dr.10356-1772962024-06-07T15:41:16Z Multimodal distillation for egocentric video understanding Peng, Han Alex Chichung Kot School of Electrical and Electronic Engineering EACKOT@ntu.edu.sg Engineering Knowledge distillation Egocentric video understanding Transformed teacher matching Advancements in smart devices, especially head-mounted wearables, create new egocentric video applications, leading to enormous multimodal egocentric scenarios. These days, multimodal egocentric video understanding has wide applications in augmented reality, education, and industries. Knowledge distillation transfers knowledge from a complex "teacher" model to a smaller "student" model. This technique is beneficial for model compression and can be applied to multimodal scenarios. Recent work uses traditional knowledge distillation scheme, assigning weights to knowledge from different modalities. But there's a lack of exploration in accelerating training, introducing more modalities. Research in multimodal egocentric video understanding is still limited. This project reviews classification and distillation strategies for knowledge, and improved methods for knowledge distillation. We use Swin-T as the teacher model and consider Swin-T and ResNet3D with the depth of 18 and 50 as the student model. We applied the optimized distillation strategies, TTM and weighted TTM, to multimodal KD. In this experiment, we used FPHA and H2O datasets. RGB and optical flow frames were extracted and packaged for both datasets. We conducted several experiments to comparatively study the performance of training different methods on different networks. We used top1 and top5 accuracy to measure the performances. It is concluded that Swin-T as a student model outperforms the ResNet3D model for distillation. TTM distillation strategy outperforms KD on different datasets and models. Finally, we summarize this project and suggest further work. Bachelor's degree 2024-06-04T07:14:59Z 2024-06-04T07:14:59Z 2024 Final Year Project (FYP) Peng, H. (2024). Multimodal distillation for egocentric video understanding. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177296 https://hdl.handle.net/10356/177296 en J3339-232 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering Knowledge distillation Egocentric video understanding Transformed teacher matching
spellingShingle	Engineering Knowledge distillation Egocentric video understanding Transformed teacher matching Peng, Han Multimodal distillation for egocentric video understanding
description	Advancements in smart devices, especially head-mounted wearables, create new egocentric video applications, leading to enormous multimodal egocentric scenarios. These days, multimodal egocentric video understanding has wide applications in augmented reality, education, and industries. Knowledge distillation transfers knowledge from a complex "teacher" model to a smaller "student" model. This technique is beneficial for model compression and can be applied to multimodal scenarios. Recent work uses traditional knowledge distillation scheme, assigning weights to knowledge from different modalities. But there's a lack of exploration in accelerating training, introducing more modalities. Research in multimodal egocentric video understanding is still limited. This project reviews classification and distillation strategies for knowledge, and improved methods for knowledge distillation. We use Swin-T as the teacher model and consider Swin-T and ResNet3D with the depth of 18 and 50 as the student model. We applied the optimized distillation strategies, TTM and weighted TTM, to multimodal KD. In this experiment, we used FPHA and H2O datasets. RGB and optical flow frames were extracted and packaged for both datasets. We conducted several experiments to comparatively study the performance of training different methods on different networks. We used top1 and top5 accuracy to measure the performances. It is concluded that Swin-T as a student model outperforms the ResNet3D model for distillation. TTM distillation strategy outperforms KD on different datasets and models. Finally, we summarize this project and suggest further work.
author2	Alex Chichung Kot
author_facet	Alex Chichung Kot Peng, Han
format	Final Year Project
author	Peng, Han
author_sort	Peng, Han
title	Multimodal distillation for egocentric video understanding
title_short	Multimodal distillation for egocentric video understanding
title_full	Multimodal distillation for egocentric video understanding
title_fullStr	Multimodal distillation for egocentric video understanding
title_full_unstemmed	Multimodal distillation for egocentric video understanding
title_sort	multimodal distillation for egocentric video understanding
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/177296
_version_	1806059748543430656

Multimodal distillation for egocentric video understanding

Similar Items