Multimodal audio-visual emotion detection

Audio and visual utterances in video are temporally and semantically dependent to each other so modeling of temporal and contextual characteristics plays a vital role in understanding of conflicting or supporting emotional cues in audio-visual emotion recognition (AVER). We introduced a novel tempor...

Full description

Saved in:
Bibliographic Details
Main Author: Chaudhary, Nitesh Kumar
Other Authors: Jagath C Rajapakse
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153490
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Audio and visual utterances in video are temporally and semantically dependent to each other so modeling of temporal and contextual characteristics plays a vital role in understanding of conflicting or supporting emotional cues in audio-visual emotion recognition (AVER). We introduced a novel temporal modelling with contextual features for audio and video hierarchies to AVER. To extract abstract temporal information, we first build temporal audio and visual sequences that are then fed into large Convolutional Neural Network (CNN) embeddings. We trained a recurrent network to capture contextual semantics from temporal interdependencies of audio and video streams by using the abstract temporal information. The encapsulated AVER approach is end-to-end trainable and enhances the state-of-art accuracies with a greater margin.