6D object pose recovery is an essential ingredient for many practical applications related to augmented reality/virtual reality (AR/VR), control and navigation of robotics, and scene understanding. While substantial progress has been made in the last decade, either using depth information from RGB-D sensors or even estimating pose from a single RGB image, improved results have been reported for instance-level recognition where source data from which a classifier is learnt share the same statistical distributions with the target data on which the classifiers will be tested. Instance-based methods cannot easily be generalized for category-level tasks, which inherently involve the challenges such as distribution shift among source and target domains, high intra-class variations, and shape discrepancies between objects.
In this project, we engineer a dedicated architecture that directly tackles the challenges of categories while estimating objects’ 6D. To this end, we utilize 3D skeleton structures, derive those as shape-invariant features, and use those features as privileged information during the training phase of our architecture. 3D skeleton structures are frequently used in order to handle shape discrepancies. We introduce “Intrinsic Structure Adaptor (ISA)”, a part-based random forest architecture, for full 6D object pose estimation at the level of categories in depth images. ISA works in the 6D space. It neither requires a segmented/cropped image, nor asks for 3D bounding box proposals. Instead of running sliding windows, ISA extracts parts from the input depth image, and feeding all those down the forest, directly votes for the 6D pose of objects. Its training phase is processed so that the challenges of the categories can successfully be tackled. 3D skeleton structures are used to represent the parts extracted from the instances of given categories, and privileged learning is employed based on these parts. Graph matching is performed during the splitting processes of random forest in such a way that the adaptation/generalization capability of the proposed architecture is improved across unseen instances. This is one-class learning, and a single classifier is learnt for all instances of the given category.