Hierarchical visual localization for visually impaired people using multimodal images

Ruiqi Cheng, Weijian Hu, Hao Chen, Yicheng Fang, Kaiwei Wang, Zhijie Xu, Jian Bai

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)

Abstract

Localization is one of the crucial issues in assistive technology for visually impaired people. In this paper, we propose a novel hierarchical visual localization pipeline based on the wearable assistive navigation device for visually impaired people. The proposed pipeline involves the deep descriptor network, 2D-3D geometric verification and online sequence matching. Images in different modalities (RGB, Infrared and Depth) are fed into Dual Desc network to generate robust attentive global descriptors and local features. The global descriptors are leveraged to retrieve the coarse candidates of query images. The 2D local features, as well as 3D sparse point cloud, are used in geometric verification to select the optimal results from the retrieved candidates. Finally, sequence matching robustifies the localization results by synthesizing the verified results of successive frames. The proposed unified descriptor network Dual Desc surpasses the state-of-the-art NetVLAD and its variant on the task of image description. Validated on the real-world dataset captured by the wearable assistive device, the proposed visual localization utilizes multimodal images to overcome the disadvantages of RGB images and robustifies the localization performance by deep descriptor network and hierarchical pipeline. In the challenging scenarios of the Yuquan dataset, the proposed method achieves the F1 score of 0.77 and the mean localization error of 2.75, which is satisfactory in practical use.
Original languageEnglish
Article number113743
Number of pages12
JournalExpert Systems with Applications
Volume165
Early online date25 Jul 2020
DOIs
Publication statusPublished - 1 Mar 2021

Fingerprint

Dive into the research topics of 'Hierarchical visual localization for visually impaired people using multimodal images'. Together they form a unique fingerprint.

Cite this