MULTIMODAL FUSION ALGORITHM AND REINFORCEMENT LEARNING-BASED DIALOG SYSTEM IN HUMAN-MACHINE INTERACTION

The Industrial Revolution 4.0 has tremendous potential in overhauling the industry, and even being able to change various aspects of human life. Furthermore, the Revolution refers to increasing automation, machine-to-machine communication and human-machine interaction, artificial intelligence, as...

Full description

Saved in:
Bibliographic Details
Main Author: Fakhrurroja, Hanif
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/54233
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The Industrial Revolution 4.0 has tremendous potential in overhauling the industry, and even being able to change various aspects of human life. Furthermore, the Revolution refers to increasing automation, machine-to-machine communication and human-machine interaction, artificial intelligence, as well as the development of sustainable technology. One of the driving factors behind the implementation of industry 4.0 is the new form of human-machine interaction using various modalities, such as speech, gesture, face detection and skeleton tracking or via smart devices. In terms of accuracy and system development, previous studies on human-machine interaction systems point to an encouraging direction both in terms of accuracy and system development. However, there are still some problems that are often encountered in human-machine interaction systems, especially those that simultaneously use several input modalities, and there include the way in which an interface system is being designed, therefore the machine is able to understand the ongoing dialogue context. Meanwhile, some other problem encountered is the ability to activate the system using various modalities, the ability of machines to understand human intent through Indonesian speech recognition and gesture therefore there exist a good dialogue between human and machines, as well as their knowledge being developed through the process of machine learning. This research proposes a multimodal fusion algorithm using a concept of logic gates at the decision level to solve the problem of the interaction system to integrate several input modalities. So, the machines can understand messages better than a single modality and the context of dialogue such as whether the conversation is between humans or humans and machines. There are two types of multimodal input, namely modalities that exist in the human body and modalities in tools commonly used by humans, such as a smartphone. Input modalities capture by the Kinect sensor, namely (1) face detection by reading three conditions, such as face engagement, looking camera and mouth open; (2) skeleton tracking to determine the number of humans caught on the Kinect camera; (3) speech recognition; (4) hand gesture recognition. The input modalities through an Android-based smartphone application, namely screen touch (tap) and speech recorded via the microphone. So, humans can still interact with machines anywhere. iv This research also proposes a dialogue system based on reinforcement learning solutions so the machines can understand the ongoing dialogue context in the most appropriate way and can develop their knowledge. Before entering the dialogue system, human speech and gestures will be converted into text using Google Cloud Speech and the support vector machine method, then the natural language understanding (NLU) method is used to understand the text through three stages, namely (1) stemming, (2) labeling word classes and filling a dialogue slots, and (3) understanding the intent using an intent classification algorithm with rule-based techniques. The intent obtained will be trained using the reinforcement learning method with the Q-learning algorithm, then it will be categorized as the user's desire to turn on or turn off electronic devices in the smart home system. The learning process can be done through rewards and punishments based on responses from users' answers. The multimodal fusion algorithm and dialogue system based on reinforcement learning generated in this study are then implemented in the smart home system. The result showed that the average accuracy rate of multimodal activation and dialogue contexts are 87.42% and 88,75%. The multimodal fusion accuracy rate is 93%. The test results show that the multimodal fusion algorithm can understand messages better than a single modality and the context of dialogue such as whether the conversation is between humans or humans and machines. Reinforcement learning-based dialog system validation is performed using a confusion matrix. The results of the average level of accuracy, precision, sensitivity (recall), and f1-score were 83%, 95%, 78%, and 84%, respectively. The accuracy level of testing the dialogue system is 92.11%. The test results show that dialogue system based on reinforcement learning can understand the ongoing dialogue context in the most appropriate way and can develop their knowledge. The satisfaction level of users with a human-machine interaction system based on multimodal fusion from 63 respondents is 95%. As many as 76.2% of users agree that the system of interaction is natural and 79.4% agree that the machine is able to respond well to the user's intention.