Depth map generation : depth estimation from images

Depth information is an important part of the 3D structure of a scene. Accurate depth information can help us better understand the scene and is useful in various applications, such as semantic segmentation, simultaneous localization and mapping (SLAM), autonomous driving cars, 3D modeling, etc. Tra...

Full description

Saved in:
Bibliographic Details
Main Author: Zhao, Yukai
Other Authors: -
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/150273
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Depth information is an important part of the 3D structure of a scene. Accurate depth information can help us better understand the scene and is useful in various applications, such as semantic segmentation, simultaneous localization and mapping (SLAM), autonomous driving cars, 3D modeling, etc. Traditional methods mostly use binocular or multi-view images for depth estimation. The most commonly used method is stereo matching technology, which uses triangulation to estimate scene depth information from the image, but it can be easily affected by the diversity of scenes, and the amount of calculation is huge. Monocular images have lower requirements for equipment and environment, and depth estimation through monocular images is closer to the actual situation and has a wider range of application scenarios. With the breakthrough of deep learning, convolutional neural networks (CNN) have demonstrated exciting and outstanding performance on depth estimation. State-of-the-art methods indicate that monocular depth estimation methods require less memory and less calculation time, and at the same time have relatively good accuracy. Monocular depth estimation has therefore attracted more attention. Our objective is to build a monocular image depth estimation model based on deep learning. In this dissertation, three published models BTSNet, BANet, and ViP-DeepLab are investigated and implemented. Our BTSNet has achieved almost the same results as the original paper, which can retain the local detail information and restore the boundaries of the objects by the local plane guidance (LPG) layer in the model. Due to the lack of detailed information on the depth to space (D2S) layer which is used to get the full resolution feature map in BANet, we propose a new model by replacing the D2S layer with the outputs of the LPG layer. The model is trained on the KITTI dataset and has achieved better performance compared with the state-of-the-art methods. Inspired by the LPG layer in the BTSNet, a new upsampling layer called LPG upsampling layer is proposed. The LPG upsampling layer aims to provide detailed information when enlarging the resolution. By introducing the LPG upsampling layer to the ViP-DeepLab, the model achieves a 5% improvement in root mean square error (RMSE).