Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Haolin Liu^1,2 Anran Lin¹ Xiaoguang Han^1,2,* Lei Yang⁴ Yizhou Yu^3,4 Shuguang Cui^1,2

¹Shenzhen Research Institute of Big Data, CUHK-Shenzhen ²The Future Network of Intelligence Institue, CUHK-Shenzhen ³Deepwise AI Lab ⁴The University of Hong Kong

Corresponding Email: haolinliu@link.cuhk.edu.cn hanxiaoguang@cuhk.edu.cn

Introduction

We present a novel task of 3D visual grounding in single-view RGB-D images where the referred objects are often only partially scanned. In contrast to previous works that directly generate object proposals for grounding in the 3D scenes, we propose a bottom-up approach to gradually aggregate information, effectively addressing the challenge posed by the partial scans. Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that coarsely localizes the relevant regions in the RGB-D image. Then our approach adopts an adaptive search based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object. We evaluate the proposed method by comparing to the state-of-the-art methods on both the RGB-D images extracted from the ScanRefer dataset and our newly collected SUN-Refer dataset. Experiments show that our method outperforms the previous methods by a large margin (by 11.2% and 15.6% Acc@0.5) on both datasets.

Figure 1. We present a novel task of 3D visual grounding in single-view RGB-D images
given a referring expression, and propose a bottom-up neural approach to address it.
Predicted bounding boxes of the referred objects are in green.

Video

Dataset

download SUNREFER_v2 dataset

SUNREFER dataset is a large-scale dataset of referring expression dedicated for visual language research in RGBD images. It contains 38,495 annotations of referring expression on 7,699 RGBD images. Below is an example from SUNREFER dataset:

Figure 2. An example of the SUNREFER dataset with five different language descriptions referring to the chair enclosed by the green bounding box.

Publication

Accepted by CVPR 2021
Paper - ArXiv - pdf (abs) | GitHub
If you find our work useful, please consider citing it:


        
        @inproceedings{liu2021refer,
          title={Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images},
          author={Liu, Haolin and Lin, Anran and Han, Xiaoguang and Yang, Lei and Yu, Yizhou and Cui, Shuguang},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          pages={6032--6041},
          year={2021}
        }