Monitoring of dyadic conversations : a social signal processing approach
This work presents a real-time system that analyzes non-verbal audio and visual cues to quantitatively assess sociometries from on-going two-person conversations. The system non-invasively captures audio and video/depth data from lapel microphones and Microsoft Kinect devices respectively to extr...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2017
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/70621 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-70621 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Electrical and electronic engineering |
spellingShingle |
DRNTU::Engineering::Electrical and electronic engineering Yasir Tahir Monitoring of dyadic conversations : a social signal processing approach |
description |
This work presents a real-time system that analyzes non-verbal audio and visual
cues to quantitatively assess sociometries from on-going two-person conversations.
The system non-invasively captures audio and video/depth data from lapel microphones
and Microsoft Kinect devices respectively to extract non-verbal speech and
visual cues. The system leverages these non-verbal cues to quantitatively assess
speaking mannerisms of each participant. The speech and visual cues are incorporated
as features in machine learning algorithms to quantify various aspects of
social behavior including Interest, Dominance, Politeness, Friendliness, Frustration,
Empathy, Respect, . Confusion, Hostility and Agreement. The most relevant speech
and visual cues are selected by forward feature selection. The system is trained and
tested on two carefully annotated corpora, i.e., an Audio Corpus (AC) and AudioVisual
Corpus (AVC) comprising brief two-person dialogs (in English). Numerical
tests through leave-one-person-out cross-validation indicate that the accuracy of the
algorithms for inferring the sociometries is in the range of 50% - 86% for AC and
62% - 92% for AVC. To test the robustness of the proposed approach, the audio
· data from both corpora are combined, and a classifier is trained on this mixed data
set. Despite the significant differences in the recording conditions of the AC and AVC, the accuracy for inferring sociometries from this mixed data set is in the range
of 51% - 81% which is reasonably high, therefore implying that the algorithms are
robust to changes in the recording conditions. The proposed algorithms have low
computational complexity. They can operate in continuous time, and yield sociometries
in real-time. Consequently, they can be implemented on real-life platforms.
The term sociofeedback has been coined to describe systems of that kind, which are
capable of analyzing conversations and providing feedback to the speakers based on
their speaking patterns.
To obtain user feedback regarding practical implementation of sociofeedback system
in realistic scenarios, the sociofeedback system was interfaced with a humanoid robot
(Nao). This enabled the humanoid robot (Nao) to provide real-time sociofeedback to
participants taking part in two-person dialogs. The sociofeedback system quantifies
speech mannerism and social behavior of participants in an ongoing conversation,
determines whether feedback is required, and delivers feedback through Nao. For
example, Nao alerts the speaker(s) when the voice is too low or too loud, or when the
conversation is not proceeding well due to disagreements or numerous interruptions.
The user study about the Nao robot comprises two set of experiments. In the first
sets of experiments, the participants rate their understanding of feedback messages
delivered via the humanoid robot. They also assess two modalities to deliver the
feedback: audio only and audio combined with gestures. In majority of the cases,
there is an improvement of 10% or more when audio and gesture modalities are
combined to deliver feedback messages. For the second set of experiments, the
sociofeedback system was integrated with the Nao robot. The participants engage
in two-person scenario based conversations while the Nao robot delivers feedback
generated by the sociofeedback system. The sociofeedback system analyzes the
conversations and provides feedback via Nao. Subsequently, the participants assess
the received sociofeedback with respect to various aspects, including its content,
appropriateness, and timing. Participants also evaluate their overall perception
of Nao via the Godspeed questionnaire. Results indicate that the sociofeedback
system is able to detect the social scenario with 93.8% accuracy, and that Nao can be effectively used to provide sociofeedback in discussions . These results pave the
way to natural human-robot interaction in a multi-party dialog system.
Another real world application of such a system that has been explored is nonverbal
speech analysis to facilitate schizophrenia treatment. Negative symptoms
in schizophrenia are associated with significant burden and functional impairment,
especially speech production. In clinical practice today, there are no robust treatments
for negative symptoms , and one obstacle surrounding its research is the lack
of an objective measure . To this end, non-verbal speech cues are explored as objective
measures. Non-verbal speech cues are extracted from schizophrenic patients
and psychologist interviews. The interviews of the patients enrolled in an observational
study on the effectiveness of Cognitive Remediation Therapy (CRT) are
analyzed. The subjects comprise schizophrenic patients undergoing CRT treatment,
and the control group consists of schizophrenic patients not undergoing CRT. Audio
recordings of the patients are made during three sessions while being evaluated
for negative symptoms over a 12-week follow-up period. In order to validate the
non-verbal speech cues, their correlations with the Negative Symptom Assessment
(NSA-16) are computed. The results suggest a strong correlation between certain
measures in the two rating sets. Supervised prediction of the subjective ratings
from the non-verbal speech features with leave-one-person-out cross-validation has
shown a reasonable accuracy of 75-80%. Furthermore, the non-verbal cues can be
used to reliably distinguish between the subjects and controls as supervised learning
methods can classify the two groups with 69-80% accuracy. |
author2 |
Justin Dauwels |
author_facet |
Justin Dauwels Yasir Tahir |
format |
Theses and Dissertations |
author |
Yasir Tahir |
author_sort |
Yasir Tahir |
title |
Monitoring of dyadic conversations : a social signal processing approach |
title_short |
Monitoring of dyadic conversations : a social signal processing approach |
title_full |
Monitoring of dyadic conversations : a social signal processing approach |
title_fullStr |
Monitoring of dyadic conversations : a social signal processing approach |
title_full_unstemmed |
Monitoring of dyadic conversations : a social signal processing approach |
title_sort |
monitoring of dyadic conversations : a social signal processing approach |
publishDate |
2017 |
url |
http://hdl.handle.net/10356/70621 |
_version_ |
1772826329689882624 |
spelling |
sg-ntu-dr.10356-706212023-07-04T17:23:25Z Monitoring of dyadic conversations : a social signal processing approach Yasir Tahir Justin Dauwels School of Electrical and Electronic Engineering Daniel Thalmann DRNTU::Engineering::Electrical and electronic engineering This work presents a real-time system that analyzes non-verbal audio and visual cues to quantitatively assess sociometries from on-going two-person conversations. The system non-invasively captures audio and video/depth data from lapel microphones and Microsoft Kinect devices respectively to extract non-verbal speech and visual cues. The system leverages these non-verbal cues to quantitatively assess speaking mannerisms of each participant. The speech and visual cues are incorporated as features in machine learning algorithms to quantify various aspects of social behavior including Interest, Dominance, Politeness, Friendliness, Frustration, Empathy, Respect, . Confusion, Hostility and Agreement. The most relevant speech and visual cues are selected by forward feature selection. The system is trained and tested on two carefully annotated corpora, i.e., an Audio Corpus (AC) and AudioVisual Corpus (AVC) comprising brief two-person dialogs (in English). Numerical tests through leave-one-person-out cross-validation indicate that the accuracy of the algorithms for inferring the sociometries is in the range of 50% - 86% for AC and 62% - 92% for AVC. To test the robustness of the proposed approach, the audio · data from both corpora are combined, and a classifier is trained on this mixed data set. Despite the significant differences in the recording conditions of the AC and AVC, the accuracy for inferring sociometries from this mixed data set is in the range of 51% - 81% which is reasonably high, therefore implying that the algorithms are robust to changes in the recording conditions. The proposed algorithms have low computational complexity. They can operate in continuous time, and yield sociometries in real-time. Consequently, they can be implemented on real-life platforms. The term sociofeedback has been coined to describe systems of that kind, which are capable of analyzing conversations and providing feedback to the speakers based on their speaking patterns. To obtain user feedback regarding practical implementation of sociofeedback system in realistic scenarios, the sociofeedback system was interfaced with a humanoid robot (Nao). This enabled the humanoid robot (Nao) to provide real-time sociofeedback to participants taking part in two-person dialogs. The sociofeedback system quantifies speech mannerism and social behavior of participants in an ongoing conversation, determines whether feedback is required, and delivers feedback through Nao. For example, Nao alerts the speaker(s) when the voice is too low or too loud, or when the conversation is not proceeding well due to disagreements or numerous interruptions. The user study about the Nao robot comprises two set of experiments. In the first sets of experiments, the participants rate their understanding of feedback messages delivered via the humanoid robot. They also assess two modalities to deliver the feedback: audio only and audio combined with gestures. In majority of the cases, there is an improvement of 10% or more when audio and gesture modalities are combined to deliver feedback messages. For the second set of experiments, the sociofeedback system was integrated with the Nao robot. The participants engage in two-person scenario based conversations while the Nao robot delivers feedback generated by the sociofeedback system. The sociofeedback system analyzes the conversations and provides feedback via Nao. Subsequently, the participants assess the received sociofeedback with respect to various aspects, including its content, appropriateness, and timing. Participants also evaluate their overall perception of Nao via the Godspeed questionnaire. Results indicate that the sociofeedback system is able to detect the social scenario with 93.8% accuracy, and that Nao can be effectively used to provide sociofeedback in discussions . These results pave the way to natural human-robot interaction in a multi-party dialog system. Another real world application of such a system that has been explored is nonverbal speech analysis to facilitate schizophrenia treatment. Negative symptoms in schizophrenia are associated with significant burden and functional impairment, especially speech production. In clinical practice today, there are no robust treatments for negative symptoms , and one obstacle surrounding its research is the lack of an objective measure . To this end, non-verbal speech cues are explored as objective measures. Non-verbal speech cues are extracted from schizophrenic patients and psychologist interviews. The interviews of the patients enrolled in an observational study on the effectiveness of Cognitive Remediation Therapy (CRT) are analyzed. The subjects comprise schizophrenic patients undergoing CRT treatment, and the control group consists of schizophrenic patients not undergoing CRT. Audio recordings of the patients are made during three sessions while being evaluated for negative symptoms over a 12-week follow-up period. In order to validate the non-verbal speech cues, their correlations with the Negative Symptom Assessment (NSA-16) are computed. The results suggest a strong correlation between certain measures in the two rating sets. Supervised prediction of the subjective ratings from the non-verbal speech features with leave-one-person-out cross-validation has shown a reasonable accuracy of 75-80%. Furthermore, the non-verbal cues can be used to reliably distinguish between the subjects and controls as supervised learning methods can classify the two groups with 69-80% accuracy. Doctor of Philosophy (EEE) 2017-05-05T06:29:44Z 2017-05-05T06:29:44Z 2017 Thesis Yasir Tahir. (2017). Monitoring of dyadic conversations: a social signal processing approach. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/70621 10.32657/10356/70621 en 155 p. application/pdf |