Multimodal continuous emotion analysis
Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166783 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-166783 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Zhang, Su Multimodal continuous emotion analysis |
description |
Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed to other research topics such as sequence-to-category emotion classification.
To create a trustworthy deep learning model for CER, it is essential to learn the long-range temporal dynamics and preserve the cross-subject generality. The reason is that emotion is a continuous event that depends on past emotional states, making it crucial to consider the dynamics over a longer time frame for a more accurate prediction. Moreover, emotion is susceptible to individual differences because it is linked to personal characteristics such as experience, mood, and personality. To tackle these challenges, we developed four approaches that utilize the advantages of long-range temporal learning and multi-modality in different ways.
Our first method, which serves as the foundation for the other three, focuses on the long-range temporal modeling for CER by utilizing unimodal emotion information. The experiment conducted using the MAHNOB-HCI database shows the superior performance of our method compared to the state-of-the-art method. Additionally, we also explore the contribution of different brain regions and EEG frequency bands towards the emotion process using a saliency map-based visualization method.
The second method proposes using the continuous labels' temporal and visual information to enhance EEG-based emotion classification. The standard configuration assigns a categorical label to each trial, ignoring the temporal variation, which may reduce the classifier's effectiveness. To overcome this limitation, a thresholding scheme is introduced to convert the emotional trace into a discretized label, allowing the training process to occur in an N-to-N manner. By discretizing the trace into three classes, the classifier can fit the features to their corresponding three-class labels more flexibly. Experimental results show a statistically significant 3\% increase in EEG-based emotion classification accuracy.
The third method trains a teacher model on the visual modality and a student model on the EEG modality, where the teacher's temporal embeddings are taken as dark knowledge for the student. By employing L1 loss and concordance correlation coefficient (CCC) loss, the student model learns to fit the teacher's knowledge and predict the continuous labels. Experimental results show that the CKD method outperforms the student model without distillation on root mean square error (RMSE), Pearson correlation coefficient (PCC), and CCC. This approach provides a promising way to leverage the complementarity of different modalities for CER.
The final method proposed in this thesis involves multimodal feature fusion for CER. Utilizing multiple modalities can disambiguate and preserve recognition robustness, improving accuracy in cases such as a crying face with joyful vocal expressions being recognized as happiness instead of sadness. The leader-follower attentive network (LFAN) is introduced to combine the learned encodings of the visual and EEG modalities using a cross-modality co-attention mechanism. The LFAN emphasizes the dominant visual modality, which is believed to have the strongest correlation with the label. Experiments on AVEC2019, MAHNOB-HCI, and AffWild2 databases demonstrate that the proposed LFAN achieves promising results compared to state-of-the-art methods. |
author2 |
Guan Cuntai |
author_facet |
Guan Cuntai Zhang, Su |
format |
Thesis-Doctor of Philosophy |
author |
Zhang, Su |
author_sort |
Zhang, Su |
title |
Multimodal continuous emotion analysis |
title_short |
Multimodal continuous emotion analysis |
title_full |
Multimodal continuous emotion analysis |
title_fullStr |
Multimodal continuous emotion analysis |
title_full_unstemmed |
Multimodal continuous emotion analysis |
title_sort |
multimodal continuous emotion analysis |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/166783 |
_version_ |
1772827675120893952 |
spelling |
sg-ntu-dr.10356-1667832023-06-01T08:00:47Z Multimodal continuous emotion analysis Zhang, Su Guan Cuntai School of Computer Science and Engineering CTGuan@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed to other research topics such as sequence-to-category emotion classification. To create a trustworthy deep learning model for CER, it is essential to learn the long-range temporal dynamics and preserve the cross-subject generality. The reason is that emotion is a continuous event that depends on past emotional states, making it crucial to consider the dynamics over a longer time frame for a more accurate prediction. Moreover, emotion is susceptible to individual differences because it is linked to personal characteristics such as experience, mood, and personality. To tackle these challenges, we developed four approaches that utilize the advantages of long-range temporal learning and multi-modality in different ways. Our first method, which serves as the foundation for the other three, focuses on the long-range temporal modeling for CER by utilizing unimodal emotion information. The experiment conducted using the MAHNOB-HCI database shows the superior performance of our method compared to the state-of-the-art method. Additionally, we also explore the contribution of different brain regions and EEG frequency bands towards the emotion process using a saliency map-based visualization method. The second method proposes using the continuous labels' temporal and visual information to enhance EEG-based emotion classification. The standard configuration assigns a categorical label to each trial, ignoring the temporal variation, which may reduce the classifier's effectiveness. To overcome this limitation, a thresholding scheme is introduced to convert the emotional trace into a discretized label, allowing the training process to occur in an N-to-N manner. By discretizing the trace into three classes, the classifier can fit the features to their corresponding three-class labels more flexibly. Experimental results show a statistically significant 3\% increase in EEG-based emotion classification accuracy. The third method trains a teacher model on the visual modality and a student model on the EEG modality, where the teacher's temporal embeddings are taken as dark knowledge for the student. By employing L1 loss and concordance correlation coefficient (CCC) loss, the student model learns to fit the teacher's knowledge and predict the continuous labels. Experimental results show that the CKD method outperforms the student model without distillation on root mean square error (RMSE), Pearson correlation coefficient (PCC), and CCC. This approach provides a promising way to leverage the complementarity of different modalities for CER. The final method proposed in this thesis involves multimodal feature fusion for CER. Utilizing multiple modalities can disambiguate and preserve recognition robustness, improving accuracy in cases such as a crying face with joyful vocal expressions being recognized as happiness instead of sadness. The leader-follower attentive network (LFAN) is introduced to combine the learned encodings of the visual and EEG modalities using a cross-modality co-attention mechanism. The LFAN emphasizes the dominant visual modality, which is believed to have the strongest correlation with the label. Experiments on AVEC2019, MAHNOB-HCI, and AffWild2 databases demonstrate that the proposed LFAN achieves promising results compared to state-of-the-art methods. Doctor of Philosophy 2023-05-11T02:32:54Z 2023-05-11T02:32:54Z 2023 Thesis-Doctor of Philosophy Zhang, S. (2023). Multimodal continuous emotion analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166783 https://hdl.handle.net/10356/166783 10.32657/10356/166783 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |