Multi modal video analysis with LLM for descriptive emotion and expression annotation

This project presents a novel approach to multi-modal emotion and action annotation by integrating facial expression recognition, action recognition, and audio-based emotion analysis into a unified framework. The system utilizes TimesFormer, OpenFace, and SpeechBrain to extract relevant features fro...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Fan, Yupei
مؤلفون آخرون:	Zheng Jianmin
التنسيق:	Final Year Project
اللغة:	English
منشور في:	Nanyang Technological University 2024
الموضوعات:	Computer and Information Science Video understanding Large language model (LLM) Multimodal analysis Feature extraction Deep learning Emotion annotation
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/180715
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	sg-ntu-dr.10356-180715
record_format	dspace
spelling	sg-ntu-dr.10356-1807152024-10-21T23:39:42Z Multi modal video analysis with LLM for descriptive emotion and expression annotation Fan, Yupei Zheng Jianmin College of Computing and Data Science ASJMZheng@ntu.edu.sg Computer and Information Science Video understanding Large language model (LLM) Multimodal analysis Feature extraction Deep learning Emotion annotation This project presents a novel approach to multi-modal emotion and action annotation by integrating facial expression recognition, action recognition, and audio-based emotion analysis into a unified framework. The system utilizes TimesFormer, OpenFace, and SpeechBrain to extract relevant features from video, audio, and facial expression data. These features are then fed into a Large Language Model (LLM) to generate descriptive annotations that provide a deeper understanding of emotions and actions in conversations, moving beyond traditional emotion labels like "happy" or "angry." This approach offers more contextually rich and human-like insights, which are especially valuable for applications in education and communication. The framework aims to highlight the potential of combining multiple state-of-the-art models to produce comprehensive descriptions, contributing to both the research community and real-world applications. Evaluation methods such as ROUGE and BERTScore are employed to assess the quality of the generated text, and visualizations like heatmaps and radar charts are used to provide insights into the effectiveness of the proposed approach. Bachelor's degree 2024-10-21T23:39:42Z 2024-10-21T23:39:42Z 2024 Final Year Project (FYP) Fan, Y. (2024). Multi modal video analysis with LLM for descriptive emotion and expression annotation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/180715 https://hdl.handle.net/10356/180715 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Video understanding Large language model (LLM) Multimodal analysis Feature extraction Deep learning Emotion annotation
spellingShingle	Computer and Information Science Video understanding Large language model (LLM) Multimodal analysis Feature extraction Deep learning Emotion annotation Fan, Yupei Multi modal video analysis with LLM for descriptive emotion and expression annotation
description	This project presents a novel approach to multi-modal emotion and action annotation by integrating facial expression recognition, action recognition, and audio-based emotion analysis into a unified framework. The system utilizes TimesFormer, OpenFace, and SpeechBrain to extract relevant features from video, audio, and facial expression data. These features are then fed into a Large Language Model (LLM) to generate descriptive annotations that provide a deeper understanding of emotions and actions in conversations, moving beyond traditional emotion labels like "happy" or "angry." This approach offers more contextually rich and human-like insights, which are especially valuable for applications in education and communication. The framework aims to highlight the potential of combining multiple state-of-the-art models to produce comprehensive descriptions, contributing to both the research community and real-world applications. Evaluation methods such as ROUGE and BERTScore are employed to assess the quality of the generated text, and visualizations like heatmaps and radar charts are used to provide insights into the effectiveness of the proposed approach.
author2	Zheng Jianmin
author_facet	Zheng Jianmin Fan, Yupei
format	Final Year Project
author	Fan, Yupei
author_sort	Fan, Yupei
title	Multi modal video analysis with LLM for descriptive emotion and expression annotation
title_short	Multi modal video analysis with LLM for descriptive emotion and expression annotation
title_full	Multi modal video analysis with LLM for descriptive emotion and expression annotation
title_fullStr	Multi modal video analysis with LLM for descriptive emotion and expression annotation
title_full_unstemmed	Multi modal video analysis with LLM for descriptive emotion and expression annotation
title_sort	multi modal video analysis with llm for descriptive emotion and expression annotation
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/180715
_version_	1814777790748164096

Multi modal video analysis with LLM for descriptive emotion and expression annotation

مواد مشابهة