Deep learning based monocular visual-inertial odometry

Autonomous vehicles require knowing their state in the environment to make a decision and achieve their desired goal. The Visual-Inertial Odometry (VIO) system is one of the most significant modules of autonomous vehicles. This module enables robots to know their position and orientation, relative t...

Full description

Saved in:
Bibliographic Details
Main Author: Mahdi Abolfazli Esfahani
Other Authors: Wang Han
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/152280
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Autonomous vehicles require knowing their state in the environment to make a decision and achieve their desired goal. The Visual-Inertial Odometry (VIO) system is one of the most significant modules of autonomous vehicles. This module enables robots to know their position and orientation, relative to the starting point, in the environment. Despite decades of research using different sensors, current odometry systems suffer from various kinds of problems, resulting in odometry failure. Different types of sensors and their combination are studied and examined by researchers in the past decades, and novel methods are proposed. In between all sensor combinations, using a monocular camera simultaneously with an Inertial Measurement Unit (IMU) is in researchers' focus in the past three years because of its lightweight, low-cost, and small combinations of sensors. This combination can be used in many applications, from smart mobile phones to micro aerial vehicles. Traditional methods attempt to extract features from visual information, track them in consecutive frames, and tightly optimize the relation between feature points in consecutive camera frames with the secured IMU information to solve the problem. However, they cannot perform well in textureless regions and fail when they cannot track features. Moreover, they cannot handle some challenging scenarios where an object moves in front of the camera. Deep learning based VIO systems are investigated to tackle this challenge. While deep learning based VIO systems achieved satisfactory results, they suffer from wrong motion estimation and scale extraction problems in different scenarios; due to their low generalization capability. This research investigates the pros and cons of existing deep learning based monocular VIO systems. It then proposes a new approach to train Convolutional Neural Networks (CNNs) for the regression problem of 6-DOF camera re-localization, which makes CNNs more robust. The proposed approach is further extended to solve the problem of Visual Odometry (VO) in an end-to-end manner. The effect of moving objects and motion blur are also studied, and approaches to make visual odometry and camera re-localization modules robust to them are proposed. Furthermore, since VO's primary step is to extract robust features that can be tracked in consecutive frames, a bio-inspired approach that learns to extract distinctive handcrafted features, from a single training image, with high generalization ability is proposed. The proposed method executes in real-time and surpasses traditional and deep learning based methods in terms of robust handcrafted feature extraction. IMU suffers from bias and measurement noise, which makes the problem of Inertial Odometry (IO) much more complicated. Due to the error propagation over time (during robot position estimation), an inaccurate estimate or a small error will cause the odometry and localization system unreliable and unusable in a split of seconds. This research presents a novel triple channel deep IO network based on IMU's physical and mathematical model to solve IO's problem. The proposed method is capable of outputting the length of orientation and position changes over time. The proposed model simulates the noise model in the training phase and becomes robust to noise in testing. Besides, the proposed network architecture includes time interval between two consecutive IMU readings as input, which makes it robust to change of IMU frame rate and miss of IMU information. The proposed network architecture outperforms all the existing solutions on the IMU readings of challenging Micro Aerial Vehicle (MAV) dataset and improves the accuracy for about 25 percent. Furthermore, the approach is further extended to extract the full 3D orientation of MAV accurately.