MULTIMODAL FUSION ALGORITHM AND REINFORCEMENT LEARNING-BASED DIALOG SYSTEM IN HUMAN-MACHINE INTERACTION
The Industrial Revolution 4.0 has tremendous potential in overhauling the industry, and even being able to change various aspects of human life. Furthermore, the Revolution refers to increasing automation, machine-to-machine communication and human-machine interaction, artificial intelligence, as...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/54233 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The Industrial Revolution 4.0 has tremendous potential in overhauling the industry,
and even being able to change various aspects of human life. Furthermore, the
Revolution refers to increasing automation, machine-to-machine communication
and human-machine interaction, artificial intelligence, as well as the development
of sustainable technology. One of the driving factors behind the implementation of
industry 4.0 is the new form of human-machine interaction using various
modalities, such as speech, gesture, face detection and skeleton tracking or via
smart devices.
In terms of accuracy and system development, previous studies on human-machine
interaction systems point to an encouraging direction both in terms of accuracy and
system development. However, there are still some problems that are often
encountered in human-machine interaction systems, especially those that
simultaneously use several input modalities, and there include the way in which an
interface system is being designed, therefore the machine is able to understand the
ongoing dialogue context. Meanwhile, some other problem encountered is the
ability to activate the system using various modalities, the ability of machines to
understand human intent through Indonesian speech recognition and gesture
therefore there exist a good dialogue between human and machines, as well as their
knowledge being developed through the process of machine learning.
This research proposes a multimodal fusion algorithm using a concept of logic
gates at the decision level to solve the problem of the interaction system to integrate
several input modalities. So, the machines can understand messages better than a
single modality and the context of dialogue such as whether the conversation is
between humans or humans and machines. There are two types of multimodal input,
namely modalities that exist in the human body and modalities in tools commonly
used by humans, such as a smartphone. Input modalities capture by the Kinect
sensor, namely (1) face detection by reading three conditions, such as face
engagement, looking camera and mouth open; (2) skeleton tracking to determine
the number of humans caught on the Kinect camera; (3) speech recognition; (4)
hand gesture recognition. The input modalities through an Android-based
smartphone application, namely screen touch (tap) and speech recorded via the
microphone. So, humans can still interact with machines anywhere.
iv
This research also proposes a dialogue system based on reinforcement learning
solutions so the machines can understand the ongoing dialogue context in the most
appropriate way and can develop their knowledge. Before entering the dialogue
system, human speech and gestures will be converted into text using Google Cloud
Speech and the support vector machine method, then the natural language
understanding (NLU) method is used to understand the text through three stages,
namely (1) stemming, (2) labeling word classes and filling a dialogue slots, and (3)
understanding the intent using an intent classification algorithm with rule-based
techniques. The intent obtained will be trained using the reinforcement learning
method with the Q-learning algorithm, then it will be categorized as the user's
desire to turn on or turn off electronic devices in the smart home system. The
learning process can be done through rewards and punishments based on
responses from users' answers.
The multimodal fusion algorithm and dialogue system based on reinforcement
learning generated in this study are then implemented in the smart home system.
The result showed that the average accuracy rate of multimodal activation and
dialogue contexts are 87.42% and 88,75%. The multimodal fusion accuracy rate is
93%. The test results show that the multimodal fusion algorithm can understand
messages better than a single modality and the context of dialogue such as whether
the conversation is between humans or humans and machines. Reinforcement
learning-based dialog system validation is performed using a confusion matrix. The
results of the average level of accuracy, precision, sensitivity (recall), and f1-score
were 83%, 95%, 78%, and 84%, respectively. The accuracy level of testing the
dialogue system is 92.11%. The test results show that dialogue system based on
reinforcement learning can understand the ongoing dialogue context in the most
appropriate way and can develop their knowledge. The satisfaction level of users
with a human-machine interaction system based on multimodal fusion from 63
respondents is 95%. As many as 76.2% of users agree that the system of interaction
is natural and 79.4% agree that the machine is able to respond well to the user's
intention. |
---|