Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images


Haolin Liu1,2      Anran Lin1      Xiaoguang Han1,2,*      Lei Yang4      Yizhou Yu3,4      Shuguang Cui1,2     

1Shenzhen Research Institute of Big Data, CUHK-Shenzhen       2The Future Network of Intelligence Institue, CUHK-Shenzhen       3Deepwise AI Lab       4The University of Hong Kong

Corresponding Email:     haolinliu@link.cuhk.edu.cn    hanxiaoguang@cuhk.edu.cn


Introduction

We present a novel task of 3D visual grounding in single-view RGB-D images where the referred objects are often only partially scanned. In contrast to previous works that directly generate object proposals for grounding in the 3D scenes, we propose a bottom-up approach to gradually aggregate information, effectively addressing the challenge posed by the partial scans. Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that coarsely localizes the relevant regions in the RGB-D image. Then our approach adopts an adaptive search based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object. We evaluate the proposed method by comparing to the state-of-the-art methods on both the RGB-D images extracted from the ScanRefer dataset and our newly collected SUN-Refer dataset. Experiments show that our method outperforms the previous methods by a large margin (by 11.2% and 15.6% Acc@0.5) on both datasets.


Figure 1. We present a novel task of 3D visual grounding in single-view RGB-D images
given a referring expression, and propose a bottom-up neural approach to address it.
Predicted bounding boxes of the referred objects are in green.

Video

Dataset

download SUNREFER_v2 dataset

SUNREFER dataset is a large-scale dataset of referring expression dedicated for visual language research in RGBD images. It contains 38,495 annotations of referring expression on 7,699 RGBD images. Below is an example from SUNREFER dataset:


Figure 2. An example of the SUNREFER dataset with five different language descriptions referring to the chair enclosed by the green bounding box.

Publication

Accepted by CVPR 2021
Paper - ArXiv - pdf (abs) | GitHub
If you find our work useful, please consider citing it:

        
        @inproceedings{liu2021refer,
          title={Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images},
          author={Liu, Haolin and Lin, Anran and Han, Xiaoguang and Yang, Lei and Yu, Yizhou and Cui, Shuguang},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          pages={6032--6041},
          year={2021}
        }