Visual place recognition for unmanned vehicles in city-scale challenging environments
Large-scale visual place recognition (VPR) involves retrieving reference images that depict the same location of the given query image. It can be applied to loop closure detection within Simultaneous Locations and Mapping (SLAM) for robot systems. With the rapid development of mobile robots, long-te...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/165647 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Large-scale visual place recognition (VPR) involves retrieving reference images that depict the same location of the given query image. It can be applied to loop closure detection within Simultaneous Locations and Mapping (SLAM) for robot systems. With the rapid development of mobile robots, long-term navigation poses greater challenges to the VPR task. For example, the same place may undergo extreme appearance changes due to different illumination, weather, and seasonal conditions. The task difficulty is also increased by partial occlusion and dynamic objects. Additionally, a robot may revisit the same place with different viewpoints. Therefore, large-scale visual place recognition under challenging conditions has raised widespread concerns in both Computer Vision and Robotics communities.
To effectively handle the task, researchers have attempted to present solutions from a variety of perspectives, which can be broadly divided into two categories. The first type of methods is devoted to the development of powerful global image descriptors for fast and accurate retrieval. A nearest neighbor search on the query image descriptor will highlight candidate reference images with smaller feature space distances from the query image. The second type of methods, in addition to the global approach, refocuses on local details. They take advantage of the spatial consistency of pixels or patches to geometrically validate the candidate reference images obtained through global retrieval. In general, these two types of methods still have some inherent flaws that must be addressed.
The global approaches are centered on developing compact and discriminative image descriptors. Early methods indiscriminately quantify all local features into the feature embedding, which may result in misleading information being encoded into the image representation. In order to highlight the task-relevant visual cues in the feature embedding, the existing attention mechanisms are either based on artificial rules or trained in a thorough data-driven manner. To fill the gap between the two types, a novel Semantic Reinforced Attention Learning (SRAL) model is firstly proposed, in which the inferred attention can benefit from both semantic priors and data-driven training. The contribution lies in two-fold. (1) An interpretable local weighting scheme based on hierarchical feature distribution is proposed to suppress misleading local features. (2) By exploiting the interpretability of the local weighting scheme, a semantic constrained initialization is proposed so that the local attention can be reinforced by semantic priors. On city-scale benchmark datasets, experiments show that SRALNet outperforms previous state-of-the-art (SOTA) global image descriptors for VPR.
Secondly, the task relevance of visual cues is heavily influenced by their context in the scene. With this in mind, a novel encoding strategy called Attentional Pyramid Pooling of Salient Visual Residuals (APPSVR) is proposed on top of SRALNet. It incorporates three types of attention modules to model the saliency of local features in individual, spatial and cluster dimensions respectively. (1) A semantic-reinforced local weighting scheme is used for local feature refinement to inhibit task irrelevant local features; (2) To leverage the spatial context, an attentional pyramid structure is constructed to adaptively encode regional features according to their relative spatial saliency; (3) To distinguish the different importance of visual clusters to the task, a parametric normalization is proposed to adjust their contribution to image descriptor generation. Experiments demonstrate that APPSVR outperforms the existing techniques and achieves a new state-of-the-art performance on VPR benchmark datasets. The visualization shows the saliency map learned in a weakly supervised manner is generally consistent with human cognition.
Thirdly, global approaches rely heavily on aggregation to produce compact image descriptors, at the expense of decoupling spatial information and ignoring local details. This may cause confusion in the retrieval of multiple scenes with similar appearances. Hopefully, a geometric consistency check of local pixels or patches will be able to validate the candidate reference images obtained by the global retrieval. According to this, a Co-Attentive Hierarchical Image Representations (CAHIR) framework is proposed for VPR, which unifies attention-sharing global and local descriptor generation into a single encoding pipeline. The hierarchical descriptors are applied to a coarse-to-fine VPR system with global retrieval and local geometric verification. To explore high-quality local matches between task-relevant visual elements, a cross-attention mutual enhancement layer is introduced to strengthen the information interaction between the local descriptors. In order for the mutual enhancement layer to perform optimally, we propose a distillation pipeline with novel selective matching loss, through which the parametric model can be fine-tuned through distillation learning. After cross-matching the enhanced local descriptors, only local correspondences with high task-relevance are preserved for subsequent geometric consistency assessment. Experiments demonstrate that CAHIR outperforms the existing global and local representations for VPR. It achieves new state-of-the-art results on city-scale benchmark datasets. The visualization also shows the learned CAHIR can place a high value on task-relevant visual elements and excels at locating local correspondences that are discriminative to the VPR task. |
---|