没有太多时间好好翻译了,大概把要点梳理了一下啦~
以前的类似工作呢一直是将1、2任务分开完成的,但作者设计了一个end-to-end 3D Conv net:SSCNet用于完善语义场景,发现效果更好。 input:a single depth image output:occupancy and semantic labels
How do we effective capture contextual information from 3D volumetric data,where the signal is sparse and lacks high frequency detail? *Solution:*A dilation-based 3D context module用于扩大感受野,能够有效捕捉来自三维体积数据的信息,其中信号是稀疏且缺乏高频细节
RGB-D datasets only provide annotationos on visible surfaces,how do we obtain training data with complete volumetric annatations at scene level? *Solution:*SUNCG datasts(we can compute 3D secen volumes with dense object labels through voxelization) (RGB-D仅仅提供了浅层表面的注释)
相关的算法都是建立在相机模型的基础之上,分单目(即单视角)和双目(即多视角,multi-view),而本论文是single-view。