Loading...
RGB-DEPTH IMAGE SEGMENTATION AND OBJECT RECOGNITION FOR INDOOR SCENES
Deng, Zhuo
Deng, Zhuo
Citations
Altmetric:
Genre
Thesis/Dissertation
Date
2016
Advisor
Committee member
Group
Department
Computer and Information Science
Subject
Permanent link to this record
Collections
Research Projects
Organizational Units
Journal Issue
DOI
http://dx.doi.org/10.34944/dspace/1058
Abstract
With the advent of Microsoft Kinect, the landscape of various vision-related tasks has been changed. Firstly, using an active infrared structured light sensor, the Kinect can provide directly the depth information that is hard to infer from traditional RGB images. Secondly, RGB and depth information are generated synchronously and can be easily aligned, which makes their direct integration possible. In this thesis, I propose several algorithms or systems that focus on how to integrate depth information with traditional visual appearances for addressing different computer vision applications. Those applications cover both low level (image segmentation, class agnostic object proposals) and high level (object detection, semantic segmentation) computer vision tasks. To firstly understand whether and how depth information is helpful for improving computer vision performances, I start research on the image segmentation field, which is a fundamental problem and has been studied extensively in natural color images. We propose an unsupervised segmentation algorithm that is carefully crafted to balance the contribution of color and depth features in RGB-D images. The segmentation problem is then formulated as solving the Maximum Weight Independence Set (MWIS) problem. Given superpixels obtained from different layers of a hierarchical segmentation, the saliency of each superpixel is estimated based on balanced combination of features originating from depth, gray level intensity, and texture information. We evaluate the segmentation quality based on five standard measures on the commonly used NYU-v2 RGB-Depth dataset. A surprising message indicated from experiments is that unsupervised image segmentation of RGB-D images yields comparable results to supervised segmentation. In image segmentation, an image is partitioned into several groups of pixels (or super-pixels). We take one step further to investigate on the problem of assigning class labels to every pixel, i.e., semantic scene segmentation. We propose a novel image region labeling method which augments CRF formulation with hard mutual exclusion (mutex) constraints. This way our approach can make use of rich and accurate 3D geometric structure coming from Kinect in a principled manner. The final labeling result must satisfy all mutex constraints, which allows us to eliminate configurations that violate common sense physics laws like placing a floor above a night stand. Three classes of mutex constraints are proposed: global object co-occurrence constraint, relative height relationship constraint, and local support relationship constraint. Segments obtained from image segmentation can be either too fine or too coarse. A full object region not only conveys global features but also arguably enriches contextual features as confusing background is separated. We propose a novel unsupervised framework for automatically generating bottom up class independent object candidates for detection and recognition in cluttered indoor environments. Utilizing raw depth map, we propose a novel plane segmentation algorithm for dividing an indoor scene into predominant planar regions and non-planar regions. Based on this partition, we are able to effectively predict object locations and their spatial extensions. Our approach automatically generates object proposals considering five different aspects: Non-planar Regions (NPR), Planar Regions (PR), Detected Planes (DP), Merged Detected Planes (MDP) and Hierarchical Clustering (HC) of 3D point clouds. Object region proposals include both bounding boxes and instance segments. Although 2D computer vision tasks can roughly identify where objects are placed on image planes, their true locations and poses in the physical 3D world are difficult to determine due to multiple factors such as occlusions and the uncertainty arising from perspective projections. However, it is very natural for human beings to understand how far objects are from viewers, object poses and their full extents from still images. These kind of features are extremely desirable for many applications such as robotics navigation, grasp estimation, and Augmented Reality (AR) etc. In order to fill the gap, we addresses the problem of amodal perception of 3D object detection. The task is to not only find object localizations in the 3D world, but also estimate their physical sizes and poses, even if only parts of them are visible in the RGB-D image. Recent approaches have attempted to harness point cloud from depth channel to exploit 3D features directly in the 3D space and demonstrated the superiority over traditional 2D representation approaches. We revisit the amodal 3D detection problem by sticking to the 2D representation framework, and directly relate 2D visual appearance to 3D objects. We propose a novel 3D object detection system that simultaneously predicts objects' 3D locations, physical sizes, and orientations in indoor scenes.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu