search for


Immersive Stereoscopic 3D System with Hand Tracking in Depth Sensor
Int. J. Fuzzy Log. Intell. Syst. 2018;18(2):146-153
Published online June 25, 2018
© 2018 Korean Institute of Intelligent Systems.

Sunjin Yu, Eungyeol Song, and Changyong Yoon

1School of Digital Media Engineering, Tongmyong University, Busan, Korea, 2Department of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea, 3Department of Electrical Engineering, Suwon Science College, Hwaseong, Korea
Correspondence to: Changyong Yoon (
Received May 22, 2018; Revised June 1, 2018; Accepted June 12, 2018.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Recently, interest in immersive tele-conference is increasing. Conventional 2D images cannot provide a stereoscopic effect, but stereoscopic 3D (S3D) images can provide a more realistic image to people. Also, hand tracking technology is an interesting research area, in the field of human computer interaction. Hand tracking technology that tracks and recognizes human hand movements has made it possible to provide a new concept of natural user interface. A depth sensor such as Kinect provides color information and depth information at the same time. In this paper, we propose an immersive S3D system by generating stereoscopic images using depth image based rendering technology and applying 3D hand tracking and augmented reality technology.

Keywords : 3D hand tracking, Stereoscopic 3D, Depth sensor, Augmented reality
1. Introduction

Recently, with the development of ICT technology, interest in communication methods through real-time video call or video conference is increasing. Technological advances have solved many problems with video calls, but there are still limitations for immersive video calls. For a realistic video call, calibration between specific equipment or camera rigs and equipment for stereoscopic 3D image generation is required. In order to solve this problem, stereoscopic image production technology using a depth sensor is being studied to apply stereoscopic 3D technology and real time viewing in a video call application.

Depth sensors [1, 2] such as Microsoft Kinect and ToF (time of flight) generate depth map in real time, which is expected to enable real-time implementation, which was a problem in conventional stereoscopic image generation technology. The use of a depth sensor (Prime-Sensor, Kinect, ToF, etc.) that can acquire real-time 3D information is applicable to real-time applications such as video calls and has the advantage of being lower in cost than existing 3D acquisition devices. Due to the commercialization of ToF camera and PrimeSensor, which can acquire 3D information in real time, various applications based on existing 2D images are extended to 3D applications, and 2D image-based hand recognition technology is also able to recognize the natural movement of the user by enabling recognition of the 3D motion of the hand. Through the development of 3D hand tracking technology, it became possible to recognize various motions of human being.

The conventional 2D image based natural user interface [36] is sensitive to ambient illumination changes, and it is not enough to perceive performance degradation and the movement of human body moving in 3D space through 2D image. Therefore, fusion of 2D and 3D information is required [79]. Depth sensor based 3D hand tracking technology solves these problems and provides information to recognize human 3D hand movements in real time. In this paper, we apply augmented reality technology to 3D hand tracking based on stereoscopic 3D system. The proposed system obtains depth map from Microsoft Kinect to generate stereoscopic 3D image and uses depth based 3D tracking technology [10, 11]. Next, 3D objects are augmented on the hand that is tracked in 3D space, and augmented reality is applied. The proposed system is an augmented reality system that generates stereoscopic 3D images from depth images, tracks human hand movements in 3D space, creates 3D virtual objects on hands, and displays stereoscopic 3D images to users.

We use depth image based rendering (DIBR) technology to generate 3D stereoscopic images from depth information [1214]. 3D hand tracking technology based on depth image is applied for hand motion estimation in 3D space. After placing a 3D virtual object on the traced 3D hand and acquiring a depth map from the 3D information of the 3D object, the virtual object is also displayed as a 3D stereoscopic image by applying the DIBR technique. In this paper, we explain depth image based technologies in Section 2 and describe the proposed stereoscopic 3D system with hand tracking in Section 3. We discuss the experimental results in Section 4 and finally conclude.

2. Background

2.1 Depth Image Based Rendering

A depth sensor such as Kinect provides both a color map and a depth map. Because the depth image contains 3D information, it can be converted into 3D information. In addition, 3D color information can be generated in 3D space by including information obtained from color image. The DIBR technique creates a virtual viewpoint from the depth image and generates a stereoscopic image by separating it into left and right binocular images.

Stereoscopic camera setup can be divided into the toed-in camera setup and shift-sensor camera setup [15]. Toed-in camera setup is similar to human eye structure, but distortion occurs such as depth non-linearity, depth plane curvature, and so on. In the shift-sensor camera setup, the CCD is set to the left and right in parallel and has convergence points. In the shift-sensor camera setup, there is no depth plane curvature distortion such as a toed-in method. In the DIBR, assuming a shift-sensor camera setup, a simplified parallax shift calculation formula is applied to reduce computational complexity in the real implementation [15, 16]. Non-linear formulation is as follows:


where w is screen width, and m is depth.

The parallax ppix in units of pixels is expressed as follows:


where Npix is the number of horizontal pixels of the display.

The linear approximation is expressed as follows:


2.2 3D Hand Tracking

An adaptive local binary pattern (ALBP) method is applied for hand tracking in depth images [10]. Conventional local binary pattern (LBP) is robust to rotation, but ALBP is robust to rotation and distance. In feature extraction for hand tracking, ALBP uses only depth information without color information. A texture T is first defined with respect to a pixel with local neighborhoods with radius size r as the joint distribution of the gray levels of I (I > 0) image pixels as shown below:


where gi (i = 0, 1, · · ·, I−1) are the pixel intensities of circular neighborhoods around the center pixel. I is the number of circular neighborhood points of ALBP. Figure 1 shows example of ALBP [10].

In this paper, the fast version of ALBP [10] is applied in order to perform real-time processing as shown below:


where si denotes s (gi + gc) and p(x)={0,x=0,1,otherwise.

3. System Overview

3.1 Pre-processing

In the proposed system, smoothing filtering is applied to remove noise from the depth image. As the size of the mask increases, the filtering operation time becomes longer and the information distortion becomes larger. Through the experiment of various morphology operations, erode, median filtering, and dilate are applied in order to remove the noise of the image. Figure 2 used morphology operation and original and filtered images in depth map.

3.2 Object Segmentation

Depth information obtained from Kinect is used to separate foreground and background. The foreground is assumed to be human and other objects are assumed to be background. In the proposed system, the original depth map size is 640 × 480 pixels. The original depth map consists of x and y coordinates and brightness values according to depth information. We first change the depth map consisting of x and y coordinates to represent x and z coordinates, where the z coordinate normalizes the depth from 0 to 255 so that the size of the transformed image is 640 × 255 pixels. Through this process, we can convert the view point of the acquired image into a bird view as seen from the top. Figure 3 shows converted depth map by x and z coordinates.

In the next step, a morphological operation is applied to remove noise and group the pixel masses. Finally, position information between objects is obtained by labeling each component using the connected component labeling method in the image represented by xz coordinates. Figure 4 result of object segmentation by connected component labeling.

Since the labeling number obtained as a result of connected component labeling changes every frame, it keeps track of the selected object in the first frame. In the first frame, a face detection algorithm [17] is used to designate a human object as foreground in order to specify a component corresponding to a foreground object among the respective components. Figure 5 shows original depth map and segmented result in color map. Figure 6 shows procedure of object segmentation.

4. Experimental Results

4.1 Depth Image Histogram Equalization

In order to maximize the 3D image viewing effect of the object and to clarify the depth difference within the object, the histogram equalization experiment of the depth image is performed. In order to convert 16 bit source depth information into 8 bit depth information, we normalize the range of 50 – 3, 500 cm from 0 – 255 from Kinect and then perform histogram equalization using 8 bit depth information using OpenCV function and MATLAB. Since the histogram of the depth image depends on the object shown on the Kinect (to create a normal video call situation), we add an object and apply depth histogram equalization to each of the three different samples. Figure 7 shows example of depth images and histogram equalization results by original depth, OpenCV and MATLAB, respectively. Figure 8 shows change of pixel value according to distance by original depth, OpenCV and MATLAB, respectively.

Histogram equalization showed that both the OpenCV and MATLAB methods showed homogeneous histogram results, but with 0 values in the hole region in the original depth image, the histogram had many frequency distributions to zero values, and the resulting histogram equalization results are not obtained. Therefore, histogram equalization using the remaining intensity values is applied except for the 0 value and the range not actually output in the experiment.

4.2 Stereoscopic 3D with 3D Hand Tracking

We assume a shift-sensor camera setup environment and reduce the computational complexity by applying the parallax shift equation for real-time display. In order to confirm real-time 3D stereoscopic images, LR images are synthesized by interlaced method. In the proposed system, the depth range is limited to 50 – 200 cm in order to enhance stereoscopic 3D effect of the object closer to the camera than the background. By restricting the depth range, it is possible to adjust the change of the values within the range in the depth image, thereby providing a more accurate 3D viewing effect on the object. However, the stereoscopic 3D effect disappears outside the depth range. Figure 9 shows generated stereoscopic images in color map.

When DIBR technique is applied to the depth image obtained from the Kinect sensor, occlusion holes are generated similar to the depth map when the image is rendered at the left and right virtual viewpoints. To solve this problem, apply the exemplar-based image in-painting algorithm [18] to the depth map. Exemplar-based image in-painting computes the priority to fill a region in a lost region with image in-painting using the assumption that there will be enough texture information to fill the lost region around the lost region. Then, the lost area is recovered by finding an area similar to the lost area through the template matching of the peripheral area from the priority area and then copying the area.

In the DIBR algorithm, a track bar is created to control the distance of both virtual cameras (or the distance between both eyes), and a depth control experiment is performed. In order to maximize the 3D viewing effect of the object part only, the range of the depth value to be processed is limited to 50 – 180 cm. By limiting the range of the depth value, it is possible to increase the variation of the values within the range of the depth image, so that a more clear view of the object can be obtained. However, there is a disadvantage that the 3D effect disappears and it may become awkward in the outside of the range. Figure 10 shows example of generated stereoscopic image by interlaced method.

4.3 3D Hand Tracking System with AR

When the user takes a focus gesture to detect a hand, the position of the hand is detected in the depth image. The focus gesture can be defined in various ways, indicating the intention of the user to start the gesture. In the experiment, the focus gesture is defined as the action of reaching the depth camera. Figure 11 shows example of focus gesture.

First, the searching window is set using the position of the hand detected in the previous frame depth image, and the position of the hand is tracked in the window. Next, the unscented Kalman filter [1921] is applied to consecutive tracking points. There is noise in the tracking point due to the noise of the depth image. To eliminate this noise, we use the unscented Kalman filter to minimize the effect of noise on the tracking point path. It is necessary to deal with cases where the hand does not exist in the searching window and hand is not detected by the other area. In order to do this, it is necessary to check whether the hand enters the searching window while operating in the tracking mode in the current searching window for a predetermined time, and initialize the hand position information in another area by operating the detection mode simultaneously.

To load a virtual 3D object on the user’s hand, we use OpenGL to create a cube-shaped object and create a depth map and texture for that object. Since the object to be raised on the hand is a virtual object, the depth of the object according to its size and position is arbitrarily set and the 3D image is generated according to the depth change by setting the cube to be rotated at an arbitrary angle every frame [22]. Figure 12 shows example of hand tracking with AR.

5. Conclusions

In this paper, we propose an immersive S3D system based on a depth camera. The proposed system generates stereoscopic image based on depth image and applies hand tracking technology at the same time. DIBR technology is applied to generate stereoscopic images from depth images. Applying morphology operations to depth images as pre-processing, and object segmentation using XZ transformation and connected component labeling are applied. In order to provide stereoscopic images efficiently, face detection is applied to color images to separate foreground and background. The proposed system is designed to control stereoscopic effect. The ALBP method is applied for hand tracking. The depth-based hand tracker starts tracking the position of the hand after the focus gesture. When the position of the hand is tracked, the virtual object is augmented on the hand, and the depth of the virtual object is designed to be changed according to the position of the hand. Experimental results show that stereoscopic image generation and hand tracking are possible at the same time.

Conflict of Interest

No potential conflict of interest relevant to this article was reported.


This research was supported by the Tongmyong University Research Grants 2018 (No. 2018F017). It was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2017R1C1B5017751).

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

Fig. 1.

Example of ALBP.

Fig. 2.

Used morphology operation and original and filtered images in depth map.

Fig. 3.

Converted depth map by x and z coordinates.

Fig. 4.

Result of object segmentation by connected component labeling.

Fig. 5.

Original depth map and segmented result in color map.

Fig. 6.

Procedure of object segmentation.

Fig. 7.

Example of depth images and histogram equalization results by original depth, OpenCV and MATLAB, respectively.

Fig. 8.

Change of pixel value according to distance by original depth, OpenCV and MATLAB, respectively.

Fig. 9.

Generated stereoscopic images with maximum stereoscopic 3D effect.

Fig. 10.

Example of generated stereoscopic image by interlaced method.

Fig. 11.

Example of focus gesture.

Fig. 12.

Example of hand tracking with AR.

  1. Kolb, A, Barth, E, Koch, R, and Larsen, R 2009. Time-of-flight sensors in computer graphics., Proceedings of Eurographics 2009: State of the Art Report, Munich, Germany.
  2. Microsoft Kinect. Available
  3. Zhang, QY, Zhang, MY, and Hu, JQ 2008. A method of hand gesture segmentation and tracking with appearance based on probability model., Proceedings of the 2nd International Symposium on Intelligent Information Technology Application, Shanghai, China, Array, pp.380-383.
  4. Kim, HS, Kurillo, G, and Bajcsy, R 2008. Hand tracking and motion detection from the sequence of stereo color image frames., Proceedings of IEEE International Conference on Industrial Technology, Chengdu, China, Array, pp.1-6.
  5. Zhong, S, and Hao, F 2008. Hand tracking by particle filtering with elite particles mean shift., Proceedings of Japan-China Joint Workshop on Frontier of Computer Science and Technology, Nagasaki, Japan, Array, pp.163-167.
  6. Zhang, QY, Zhang, MY, and Hu, JQ (2009). Hand gesture contour tracking based on skin color probability and state estimation model. Journal of Multimedia. 4, 349-355.
  7. Kim, YH, Lee, H, and Kim, S (2012). 3D radar objects tracking and reflectivity profiling. International Journal of Fuzzy Logic and Intelligent Systems. 12, 263-269.
  8. Choi, H (2017). CNN output optimization for more balanced classification. International Journal of Fuzzy Logic and Intelligent Systems. 17, 98-106.
  9. Kim, B, Park, S, and Kim, E (2017). Hough transform-based road boundary localization. International Journal of Fuzzy Logic and Intelligent Systems. 17, 162-169.
  10. Kim, J, Yu, S, Kim, D, Toh, KA, and Lee, S (2017). An adaptive local binary pattern for 3D hand tracking. Pattern Recognition. 61, 139-152.
  11. Kim, J, and Yoon, C (2016). Three-dimensional head tracking using adaptive local binary pattern in depth images. International Journal of Fuzzy Logic and Intelligent Systems. 16, 131-139.
  12. Fhen, C (2004). Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV. Proceedings of SPIE: Stereoscopic Displays and Virtual Reality Systems XI. 5291, 93-104.
  13. Chen, W, Giger, ML, and Bick, U (2006). A fuzzy C-means (FCM)-based approach for computerized segmentation of breast lesions in dynamic contrast-enhanced MR images. Academic Radiology. 13, 63-72.
    Pubmed CrossRef
  14. Chan, T, and Shen, J (2002). Mathematical models for local nontexture inpaintings. SIAM Journal on Applied Mathematics. 62, 1019-1043.
  15. Fehn, C 2003. A 3D-TV approach using depth-image-based rendering (DIBR)., Proceedings of the IASTED International Conference on Visualization, Imaging and Image Processing (VIIP), Benalmadena, Spain, pp.482-487.
  16. ISO/IEC JTC1/SC29/WG11 (2007). Text of ISO/IEC FDIS 23002–3 Representation of Auxiliary Video and Supplemental Information. Document N8768.
  17. Viola, P, and Jones, M 2001. Rapid object detection using a boosted cascade of simple features., Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, Array.
  18. Criminisi, A, Perez, P, and Toyama, K 2003. Object removal by exemplar-based inpainting., Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, Array.
  19. Park, S, Yu, S, Kim, J, Kim, S, and Lee, S (2012). 3D hand tracking using Kalman filter in depth space. EURASIP Journal on Advances in Signal Processing. 2012.
  20. Wan, EA, and Van Der Merwe, R 2000. The unscented Kalman filter for nonlinear estimation., Proceedings of the IEEE Symposium 2000 on Adaptive Systems for Signal Processing, Communication and Control, Lake Louise, Canada, Array, pp.153-158.
  21. Julier, SJ, and Uhlmann, JK (1997). New extension of the Kalman filter to nonlinear systems. Proceedings of SPIE: Signal Processing, Sensor Fusion, and Target Recognition VI. 3068, 182-193.
  22. Kim, YB, Lee, HC, and Rhee, SY (2010). Augmented reality of robust tracking with realistic illumination. International Journal of Fuzzy Logic and Intelligent Systems. 10, 178-183.

Sunjin Yu received a B.S. degree from the Department of Electronics and Information Engineering of Korea University in 2003 and received his M.S. degree in Graduate Program in Biometrics and Ph.D. degree in Electrical and Electronic Engineering from Yonsei University in 2003 and 2011, respectively. He was a senior research engineer in LGE Advanced Research Institute from 2011 to 2012. From 2012 to 2013, he was a research professor in the Department of Electrical Engineering in Yonsei University and from 2013 to 2016, he was a professor in the Department of Broadcasting and Image, Cheju Halla University. He is currently a professor in the School of Digital Media Engineering, Tongmyong University. His research interests include 3D computer vision, human computer interaction, and augmented/virtual reality.


Eungyeol Song is a Ph.D. student in the Electrical and Electronic Engineering at the Yonsei University, Seoul, Korea. He received his B.S. degree in Computer Science at Dankook University in 2010 and his M.S. degree from the department of Electronic and Electrical Engineering at the Dankook University in 2012. His current research interest includes hand tracking, 3D reconstruction and machine learning.


Changyong Yoon received B.S., M.S., and Ph.D. degrees in Electrical and Electronic Engineering from Yonsei University, Seoul, Korea, in 1997, 1999, and 2010, respectively. He was a senior research engineer in LG Electronics, Inc., and LG-Nortel, and he developed system software for the DVR and WCDMA from 1999 to 2006. From 2010 to February 2012, he was a chief research engineer in LG Display and developed the circuit and algorithms in touch systems. Since 2012, he has been a Professor in the Department of Electrical Engineering, Suwon Science College, Gyeonggi-do, Korea. His main research interests include intelligent transportation systems, pattern recognition, robot vision, and fuzzy application systems.