Abstract:Traditional three-dimensional (3D) dense captioning methods have problems such as insufficient consideration of point-cloud context information, loss of feature information, and thin hidden state information. Therefore, a multi-level context voting network is proposed. It uses the self-attention mechanism to capture the context information of point clouds in the voting process and utilizes it at multiple levels to improve the accuracy of object detection. Meanwhile, the temporal fusion of hidden state and attention module is designed to fuse the hidden state of the current moment with the attention result of the previous moment to enrich the information of the hidden state and thus improve the expressiveness of the model. In addition, a “two-stage” training method is adopted in the model, which can effectively filter out the generated low-quality object proposals and enhance the description effect. Extensive experiments on official datasets ScanNet and ScanRefer show that this method achieves more competitive results compared to baseline methods.