Loading...
Thumbnail Image
Item

Exploration of 3D Images to Understand 3D Real World

Li, Peiyi
Research Projects
Organizational Units
Journal Issue
DOI
http://dx.doi.org/10.34944/dspace/3176
Abstract
Our world is composed of 3-dimension objects. Every one of us is living in a world with X, Y and Z axis. Even though the way we record our world is usually taking a photo: reduce dimensionality from 3-dimension to 2-dimension, the most natural and vivid way to understand the world, and to interact with it, is to sense from our 3D real world. We human beings are sensoring our 3D real world everyday using our build-in stereo system: two eyes. In another word, the raw source data human beings obtain to recognize the real 3D world has depth information. It is not difficult to figure out: Will it help if we give machines depth map of a scene during understanding the 3D real world using computer vision technologies? The answer is yes. Following this concept, my research work is focused on 3D topics in Computer Vision. 3-dimension world is the most intuitive and vivid world human beings can perceive. In the past, it is very costly to get 3D raw source data. However, things have changed since the release of many 3D sensors in recent decades. With the help of many modern 3D sensor, I am motivated to choose my research topics among this direction. Nowadays, 3D sensor has been used in various aspects of industries. In gaming industry, we have many kinds of commercial in-door 3D sensors. This kind of sensors can generate 3D cloud points in in-door environment with very cheap cost. Thus, provides depth information to traditional computer vision algorithms, and achieves state-of-the-art detection results of human body skeleton. 3D sensor in gaming brings out new ways to interact with computers. In medical industry, engineers offer cone beam computed tomography (CBCT). The raw source data this technology provides gives doctors the idea of holographic structure of target soft/hard tissue. By extend pattern recognition algorithms from 2D to 3D, computer vision scientists can now suggest doctors with 3D texture feature, and help them when diagnose. My research works are along these two lines. In medical image, by looking into trabecular bone 3D structures, I want to use Computer Vision tools to interpret the most tiny density change. In human-computer-interaction task, by studying the 3D point cloud, I want to find a way to estimate human hand pose. First of all, in Medical Image, by using Computer Vision methods, I want to find out a useful algorithm to distinguish bone texture patterns. This task is critical in clinical diagnosis. Variations in trabecular bone texture are known to be correlated with bone diseases, such as osteoporosis. In my research work, we propose a multi-feature multi-ROI (MFMR) approach for analyzing trabecular patterns inside the oral cavity using cone beam computed tomography (CBCT) volumes. For each dental CBCT volume, a set of features including fractal dimension, multi-fractal spectrum and gradient based features are extracted from eight regions-of-interest (ROI) to address the low image quality of trabecular patterns. Then, we use generalized multi-kernel learning (GMKL) to effectively fuse these features for distinguishing trabecular patterns from different groups. To validate the proposed method, we apply it to distinguish trabecular patterns from different gender-age groups. On a dataset containing dental CBCT volumes from 96 subjects, divided into gender-age subgroups, our approach achieves 96.1\% average classification rate, which greatly outperforms approaches without the feature fusion. Besides, in human-computer-interaction task, the most natural way is to use your hand pointing things, or use a gesture to express your ideas. I am motivated to estimate all skeleton joint locations in 3D space, which is the foundation of all gesture understanding. Through logical decision on these skeleton join locations, we can obtain the Semantics behind the hand pose gesture. So, the task is to estimate a hand pose in 3D space, locating all skeletal joints. A real-time 3D hand pose estimation algorithm is then proposed using the randomized decision forest framework. The algorithm takes a depth image as input and generates a set of skeletal joints as output. Previous decision-forest-based methods often give labels to all points in a point cloud at a very early stage and vote for the joint locations. By contrast, this algorithm only tracks a set of more flexible virtual landmark points, named segmentation index points (SIPs), before reaching the final decision at a leaf node. Roughly speaking, an SIP represents the centroid of a subset of skeletal joints, which are to be located at the leaves of the branch expanded from the SIP. Inspired by a latent regression-forest-based hand pose estimation framework, we integrate SIP into the framework with several important improvements. The experimental results on public benchmark datasets show clearly the advantage of the proposed algorithm over previous state-of-the-art methods, and the algorithm runs at 55.5 fps on a normal CPU without parallelism. After the study on RGBD (RGB-depth) images, we have come to another issue. When we want to take advantage of our algorithms, and make an application, we find it really hard to accomplish. The majority of devices today are equipped with RGB cameras. Smart devices in recent years rarely have RGBD cameras on them. We have come to a dilemma that we are not able to apply our algorithms to more general scenarios. So I have changed my perspective to try some 3D reconstruction algorithms on ordinary RGB cameras. As a result, we shift our attention to human face analysis in RGB images. Detection faces in photos are critical in intelligent applications. However, this is far from enough for modern application scenarios. Many applications require accurate localization of facial landmarks. Face Alignment (FA) is critical for face analysis, it has been studied extensively in recently years. For academia, research work among this line is challenging when face images have extreme poses, lighting, expressions, and occlusions etc. Besides, FA is also a fundamental component in all face analysis algorithms. For industry, once having these facial key point locations, many impossible applications becomes reachable. A robust FA algorithm is in great demand. We developed our proposed Convolutional Neural Networks (CNN) on Deep Learning framework Caffe while employing a GPU server of 8 NVIDIA TitanX GPUs. Once finalized the CNN structure, thousands of human-labeled face image data are used to train the proposed CNN on a GPU server cluster with 2 nodes connected by Infinite Band. Each node has 4 NVIDIA K-40 GPU on its own. Our framework outperforms deep learning state-of-the-art algorithms.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
Embedded videos