Audio-visual adapter for multi-modal deception detection
Deception detection based on human behaviors holds significant importance in various fields, including customs security and multimedia anti-fraud. However, the progress of deception detection research is hindered by two main challenges: the scarcity of high-quality deception data and the complexitie...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171383 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Deception detection based on human behaviors holds significant importance in various fields, including customs security and multimedia anti-fraud. However, the progress of deception detection research is hindered by two main challenges: the scarcity of high-quality deception data and the complexities of learning from multimodal data. Also, there is a lack of Asian deception data. These limitations pose obstacles to the advancement of research in deception detection, emphasizing the need for further exploration and development in this area. To address the scarcity of high-quality deception data with Asian subjects, a multi-modal dataset that is tailored to identify deception is collected in this project. This dataset encompasses four distinct conversational scenarios, each of which includes a substantial amount of deceptive content. In total, it includes Asian speakers and is diverse across multiple languages, genders, ethnicities, and ages. In recent times, there has been increasing interest in audio-visual deception detection, as it has shown superior performance compared to using a single modality alone. However, in real-world scenarios where multiple modalities are involved, issues related to data integrity may arise. For example, there might be instances where only partial modalities are available. This absence of certain modalities may decrease performance, even though the model can capture features from the missing modality.
In order to address the challenge of missing modalities and further enhance performance, a framework called Audio-Visual Adapter (AVA) is proposed. This framework efficiently fuses temporal features across two modalities to overcome the missing modality problem. The AVA captures the same time slot vision feature and audio feature as a new temporal feature. If one modality is missing, the existing modality can also get the information from the missing modality. leveraging the capabilities of AVA, we aim to significantly improve performance in multi-modal deception detection.
The experiments are conducted on two benchmark datasets, and the results demonstrate the proposed AVA outperforms other multi-modal fusion techniques, particularly in flexible-modal settings involving multiple and missing modalities. This approach achieves superior performance and showcases the potential of leveraging the AVA framework in audio-visual deception detection. |
---|