Haolin Liu1,2 Anran Lin1 Xiaoguang Han1,2,* Lei Yang4 Yizhou Yu3,4 Shuguang Cui1,2
1Shenzhen Research Institute of Big Data, CUHK-Shenzhen 2The Future Network of Intelligence Institue, CUHK-Shenzhen 3Deepwise AI Lab 4The University of Hong Kong
Corresponding Email: haolinliu@link.cuhk.edu.cn hanxiaoguang@cuhk.edu.cn
We present a novel task of 3D visual grounding in single-view RGB-D images where the referred objects are often only partially scanned. In contrast to previous works that directly generate object proposals for grounding in the 3D scenes, we propose a bottom-up approach to gradually aggregate information, effectively addressing the challenge posed by the partial scans. Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that coarsely localizes the relevant regions in the RGB-D image. Then our approach adopts an adaptive search based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object. We evaluate the proposed method by comparing to the state-of-the-art methods on both the RGB-D images extracted from the ScanRefer dataset and our newly collected SUN-Refer dataset. Experiments show that our method outperforms the previous methods by a large margin (by 11.2% and 15.6% Acc@0.5) on both datasets.

SUNREFER dataset is a large-scale dataset of referring expression dedicated for visual language research in RGBD images. It contains 38,495 annotations of referring expression on 7,699 RGBD images. Below is an example from SUNREFER dataset:

        
        @inproceedings{liu2021refer,
          title={Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images},
          author={Liu, Haolin and Lin, Anran and Han, Xiaoguang and Yang, Lei and Yu, Yizhou and Cui, Shuguang},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          pages={6032--6041},
          year={2021}
        }