Adaptive Mean Shift Based Face Tracking by Coupled Support Map

Yongwon Hwang^{1,2}, Mun-Ho Jeong^{3}, Sang-Rok Oh^{2}, and Changyong Yoon^{4}

Received May 19, 2017; Revised June 23, 2017; Accepted June 23, 2017.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

- Abstract
The mean-shift algorithm is a local search technique that uses the similarity of the color distributed information between the target model and the local candidate image in the target. It has been proven to be superior in simplicity and stability of the technique and has been widely used for face tracking applications. However, one of the major problems in face tracking using color distribution is its vulnerability to backgrounds with similar color distribution, occlusion, and illumination changes. In this paper, we suggest a Coupled Support Map (CSM) to resolve the problem, and show the effectiveness of the robust adaptive mean-shift (AMS) face tracking method. Through series of experiments, we conclude the robustness of suggested algorithms against face size and sudden lighting changes.

**Keywords**: Mean shift, Face tracking, Occlusion, Depth images, Coupled Support Map

- 1. Introduction
Many studies have been conducted on effective tracking methods through visual features in complex environments. Among them, objective tracking is one of most challenging problems due to its complexities. Object tracking is to detect the location and size of objects of interest from continuous images and it has many different applications such as monitoring systems [1–4], intelligent user interfaces [5], intelligent sites [6], real-time processing applications such as video compression [7] that requires the ability to track moving objects.

In object tracking problems, the major challenging barrier is to recognize a unique either image or object in various environment to enhance its robustness. Another important aspect is to optimize computational processing time. The mean-shift algorithms have recently attracted attention as a method for real-time tracking of non-rigid bodies based on visual feature values, such as color or texture.

The mean-shift algorithm is a local search technique that uses the similarity of the color distribution between the target model and the local candidate image in the target model and proved to be superior to other methods in its simplicity and stability, and has been widely used in real-time objects.

However, when applied to face tracking using color distribution, there is a disadvantage that it is vulnerable to the influence of object or background having a color distribution similar to the target model. In addition, if the color distribution of the face changes due to the change of the illumination or the like, it fails the tracking. Lee et al. [8] has proposed a mean-shift-based face tracking technique that uses distance information as an efficient discrimination between faces and backgrounds and color information as advantages of object discrimination. It shows similar color backgrounds and robustness to occlusion but did not address the methodology for coping with lighting changes.

In this paper, based on the method of [8], we have revised Coupled Support Map (CSM) to enhance the robustness to lighting conditions of tracking algorithm. This is a combination of Depth Support Map (DSM) using distance information, color histogram and radial-distance based Likelihood Support Map. Color histograms using the CSM and bandwidth update in the tracking area enabled robust face tracking even during sudden lighting changes.

This paper briefly discusses the existing mean-shift techniques in the next section, and Section 3 describes the DSM and Likelihood Support Map. Section 4 presents a modified mean-shift scheme based on the Support Map. In Section 5, we show the result of applying to face tracking. Finally we conclude in Section 6.

- 2. Mean-Shift Tracking
The mean-shift algorithm finds the position that maximizes the similarity of the color histogram between the target model and the candidate model, and is robust to the rotation and shape change of the object. For this reason, it is often used for face tracking.

### 2.1 Representation of Color Models

The color model is represented using a kernel profile. In order to increase the reliability of the color distribution when the boundary pixels belong to or overlap the background, the kernel profile is a function that assigns a small weight to the pixels located far from the center of the region. When the center position is (0, 0) and the radius of the kernel profile is 1, the target model is expressed as

Eq. (1) .$${q}_{u}=C\sum _{i=1}^{N}k{(\Vert {\mathbf{x}}_{i}^{*}\}}^{2})\delta [b({\mathbf{x}}_{i}^{*})-u],$$ where

*N*is the number of pixels belonging to the face region,*k*(*x*) is the kernel function,*δ*is the Kronecker delta function, and$b({\mathbf{x}}_{i}^{*})$ is the color value at the normalized position${\mathbf{x}}_{i}^{*}$ of the pixel I.The normalization constant C is determined from the condition

${\sum}_{i=1}^{m}{q}_{u}=1$ . In the same way, the color distribution of the candidate region is given byEq. (2) .$$\begin{array}{l}{\widehat{p}}_{u}(\mathbf{y})={C}_{h}\sum _{i=1}^{{N}_{\mathbf{H}}}k\hspace{0.17em}\left({(\mathbf{y}-{\mathbf{x}}_{i})}^{\text{T}}{\mathbf{H}}^{-1}(\mathbf{y}-{\mathbf{x}}_{i})\right)\hspace{0.17em}\delta [b({\mathbf{x}}_{i})-u]\\ \mathbf{H}=\left[\begin{array}{cc}{h}_{x}^{2}& 0\\ 0& {h}_{y}^{2}\end{array}\right],\end{array}$$ where

*h*,_{x}*hy*are the bandwidths of the kernel. The color value*u*is an 8-bit value consisting of 3 bits each of red (R) and green (G) and 2 bits of blue (B) as shown in Figure 1.### 2.2 Mean-Shift

The object tracking using the general average moving method is expressed using the Epanechnikov kernel, which allocates a smaller weight as the distance from the center

*y*of the face area increases [9]. Then, the position of the object is estimated by the iterative calculation as inEq. (3) .$${y}_{t+1}=\frac{{\sum}_{i=1}^{{N}_{\mathbf{H}}}{\mathbf{x}}_{i}{w}_{i}g\hspace{0.17em}\left({({\mathbf{y}}_{t}-{\mathbf{x}}_{i})}^{\text{T}}{\mathbf{G}}^{-1}({\mathbf{y}}_{t}-{\mathbf{x}}_{i})\right)}{{\sum}_{i=1}^{{N}_{\mathbf{H}}}{w}_{i}g\hspace{0.17em}({({\mathbf{y}}_{t}-{\mathbf{x}}_{i})}^{\text{T}}{\mathbf{G}}^{-1}({\mathbf{y}}_{t}-{\mathbf{x}}_{i}))}$$ Here,

*N*is the number of pixels belonging to the face region in*t*and*g*(*x*) is the differential function of the Epanechnikov kernel.*w*is a weight of the pixel_{i}*i*, It is expressed byEq. (4) .$${w}_{i}=\sum _{u=1}^{m}\sqrt{\frac{{\widehat{q}}_{u}}{{\widehat{p}}_{u}({\widehat{\mathbf{y}}}_{0})}}\delta [b({\widehat{\mathbf{x}}}_{i})-u],$$

- 3. Support Map
Conventional mean-shift algorithms, which require a constant color distribution of the tracking object, have limitations in application in lighting-changing environments. In a dynamic environment where the illumination changes, the similarity between the target model and the candidate model exceeds the allowable range, which increases the possibility of object tracking failure.

It may also cause tracking failures by other backgrounds or objects with color distributions similar to the target model. As a solution to this problem, we have devised a DSM using distance information, a color histogram and a radial-distance-based Likelihood Support Map. The mean-shift algorithm using this is introduced in Section 4.

### 3.1 Depth Support Map

In the case where an object or background similar to the color distribution of the candidate model exists near the candidate model, the possibility of face tracking failure is increased when the face is covered by other objects. In order to solve this problem, the DSM is suggested. The DSM helps to track the face more strongly and clearly when there are objects and backgrounds with similar color distributions in particular.

Assuming that the

*i*-th pixel among the pixels constituting the face image in one frame is*x*, the DSM_{i}*x*corresponding to_{i}*s*(_{d}x ) in the current frame is defined as follows:_{i}$${s}_{d}({\mathbf{x}}_{i})=\{\begin{array}{ll}-1\hfill & \text{if\hspace{0.17em}}\overline{d}-d({\mathbf{x}}_{i})<-3\sigma ,\hfill \\ 1\hfill & \text{if\hspace{0.17em}}\mid \overline{d}-d({\mathbf{x}}_{i})\mid \hspace{0.17em}\le 3\sigma ,\hfill \\ 0\hfill & \text{if\hspace{0.17em}}\overline{d}-d({\mathbf{x}}_{i})>3\sigma .\hfill \end{array}$$ *d̄*is the average of the distance values of the face area determined at time*t*− 1, i.e., the face area at the previous frame,*d*is the distance value at the time_{i}*t*, i.e., the*x*position of the candidate model in the current frame, and_{i}*D*. As a critical point, a value that is usually three times or less the standard deviation of the distance value distribution is used._{t}*D*means the pixel_{face}*x*exists at a distance similar to the face region,_{i}*D*is used to distinguish the background from the face, and_{far}*D*is used to distinguish the object that covers the face._{close}This DSM is effective when it is used to distinguish faces using distance information, but there is a weakness to obscuring objects having the same distance information. When such an object is occluded, the center position and size of the face region are changed in the face tracking by the conventional mean-shift technique. This can cause problems such as face recognition and face direction estimation performed on the face tracking area. For example, even if the center position and size of the face area are changed by covering the area around the mouth, if the area is regarded as the entire face area, it may lead to errors in face recognition or face direction calculation.

### 3.2 Likelihood Support Map

The tracking area by mean-shift can be expressed as an ellipse area using bandwidth as shown in

Eq. (6) .$$d({\mathbf{x}}_{i})=\frac{{({x}_{i}-{c}_{x})}^{2}}{{h}_{x}^{2}}+\frac{{({y}_{i}-{c}_{y})}^{2}}{{h}_{y}^{2}}\le 1,\mathrm{\hspace{0.17em}\u200a\u200a}({x}_{i},{y}_{i})\in {W}_{{c}_{x},{c}_{y}},$$ where

x = (_{i}*x*_{i}*, y*) is the coordinate of pixel_{i}*i*and (*c*,_{x}*c*) is the center of tracking area_{y}*W*_{cx,cy}. The probability that each pixel belongs to the real face region within this ellipse region becomes larger as it gets closer to the center point, and as the probability of not belonging to the face increases, the probability increases. From this fact, we define the probability based on radial distance as follows:$${P}_{r}({\mathbf{x}}_{i}\mid A=face)=1-{d}_{i}({\mathbf{x}}_{i}),$$ $${P}_{r}({\mathbf{x}}_{i}\mid A=bkg)={d}_{i}({\mathbf{x}}_{i}).$$ Here,

*A*is a random variable representing a region to which a pixel belongs. On the other hand, the probability for each area of the pixel from the color histogram of the face area and the background area obtained from the previous frame is given as follows:$${P}_{h}({\mathbf{x}}_{i}\mid A=face)={q}_{face}(u({\mathbf{x}}_{i})),$$ $${P}_{h}({\mathbf{x}}_{i}\mid A=bkg)={q}_{bkg}(u({\mathbf{x}}_{i})).$$ Here,

*u*(x ) is a color value at_{i}x ._{i}From

Eqs. (7) –(9) , the likelihood that the pixel belongs to the face area and the background area is$$P({\mathbf{x}}_{i}\mid A=face)={P}_{r}({\mathbf{x}}_{i}\mid A=face){P}_{h}({\mathbf{x}}_{i}\mid A=face),$$ $$P({\mathbf{x}}_{i}\mid A=bkg)={P}_{r}({\mathbf{x}}_{i}\mid A=bkg){P}_{h}({\mathbf{x}}_{i}\mid A=bkg),$$ respectively. At this time, the Likelihood Support Map can be defined as follows by the color histogram and the elliptic equation.

$${s}_{l}({\mathbf{x}}_{i})=\{\begin{array}{ll}1\hfill & \text{if\hspace{0.17em}}P({\mathbf{x}}_{i}\mid A=face)>P({\mathbf{x}}_{i}\mid A=bkg),\hfill \\ 0\hfill & \text{otherwise.}\hfill \end{array}$$ The pixel can have a significantly lower probability value in both color histograms of the previous frame, which is due to the sudden illumination change. In order to cope with such illumination changes, the probability value by the color histogram is ignored for stable tracking. In other words,

$${P}_{h}({\mathbf{x}}_{i}\mid A=face)={P}_{h}({\mathbf{x}}_{i}\mid A=face)=1,$$ if

${\scriptstyle \frac{{q}_{face}(u({\mathbf{x}}_{i}))}{{p}_{h}(u({\mathbf{x}}_{i}))}}<0.5$ and${\scriptstyle \frac{{q}_{bkg}(u({\mathbf{x}}_{i}))}{{p}_{h}(u({\mathbf{x}}_{i}))}}<0.5$ .

- 4. Support Map Based Mean-Shift Algorithm
### 4.1 Coupled Support Map

Define the CSM as shown in

Eq. (15) from the two Support Maps described in the previous chapter. From this, we redefine the pixel weights ofEq. (4) asEq. (16) .$$C=\{{c}_{i}\mid {c}_{i}={s}_{d}({\mathbf{x}}_{i})\times {s}_{l}({\mathbf{x}}_{i}),\hspace{0.17em}i=1,\dots ,{N}_{\mathbf{H}}\},$$ $${w}_{i}^{\prime}=\{\begin{array}{ll}{w}_{i}\hfill & \text{if\hspace{0.17em}}{c}_{i}=1,\hfill \\ 1\hfill & \text{if\hspace{0.17em}}{c}_{i}=0,\hfill \\ 0\hfill & \text{if\hspace{0.17em}}{c}_{i}=-1.\hfill \end{array}$$ Here, if

*c*= 1, it corresponds to the face area, so the original weight value is used as it is. When_{i}*c*= 0, it corresponds to the face area, but it is occluded, and_{i}${w}_{i}^{\prime}$ is set to 1 to prevent the position estimation error due to this. If the pixel has a similar color distribution to the target model and belongs to the background, then*c*= −1. To eliminate this effect in the mean-shift calculation,_{i}${w}_{i}^{\prime}=0$ . In the update of the target model described in the next section, only pixels with*c*= 1 are used. That is, even if it belongs to the face area, it does not consider the hidden pixel._{i}### 4.2 Update of Goal Model

As a target model for the mean-shift calculation, the color histogram of the face region must be updated to cope with the illumination change. The problem with adopting this adaptive target model is that it is easy to generate a target model that is distant from the current face region due to error accumulation. This problem can be solved by using the CSM to distinguish pixels in the non-face region shown in

Eq. (17) . The color histogram of the non-face area in the tracking area is also updated every frame to calculate the Likelihood Support Map shown inEq. (18) .$${q}_{face}(u)=C\sum _{i=1}^{{N}_{\mathbf{H}}}(1-d({\mathbf{x}}_{i}))\delta \lceil b({\mathbf{x}}_{i})-u\rceil \delta ({c}_{l}-1),$$ $${q}_{bkg}(u)=C\sum _{i=1}^{{N}_{\mathbf{H}}}(1-d({\mathbf{x}}_{i}))\delta \lceil b({\mathbf{x}}_{i})-u\rceil \xb7\delta [\delta ({c}_{l}-1)].$$ Updating bandwidth is essential for updating target models and candidate models [10]. CSM is applied to maximum likelihood based bandwidth calculation method as follows:

$${H}_{t+1}=\frac{{\sum}_{i=1}^{{N}_{\mathbf{H}}}\text{max}\lceil \delta ({c}_{i}-1),\delta ({c}_{i})\rceil \hspace{0.17em}({\mathbf{x}}_{t+1}-{\mathbf{x}}_{i}){({\mathbf{x}}_{t+1}-{\mathbf{x}}_{i})}^{\text{T}}}{{\sum}_{i=1}^{{N}_{\mathbf{H}}}\text{max}\lceil \delta ({c}_{i}-1),\delta ({c}_{i})\rceil}.$$ In this case, since

*c*is ‘1’ and ‘0’, it is included in the calculation of the bandwidth since the pixel_{i}*i*corresponds to the face region.### 4.3 Support Map Based Mean-Shift Algorithm

The initialization process of the support map based mean-shift algorithm is based on face detection [11]. That is, from the result of face detection, the tracking area is defined as the face area and the non-face area as shown in Figure 2, and each color histogram is calculated. The follow-up process is as follows:

① Create Candidate Model in Trace Region

② Creation of Coupled Support Map

③ Mean-Shift calculation

$${\mathbf{y}}_{t+1}=\frac{{\sum}_{i=1}^{{N}_{\mathbf{H}}}{\mathbf{x}}_{i}{w}_{i}^{\prime}g\hspace{0.17em}\left({({\mathbf{y}}_{t}-{\mathbf{x}}_{i})}^{\text{T}}{\mathbf{H}}^{-1}({y}_{t}-{\mathbf{x}}_{i})\right)}{{\sum}_{i=1}^{{N}_{\mathbf{H}}}{w}_{i}^{\prime}g\hspace{0.17em}({({\mathbf{y}}_{t}-{\mathbf{x}}_{i})}^{\text{T}}{\mathbf{H}}^{-1}({y}_{t}-{\mathbf{x}}_{i}))}$$ ④ Update color histogram for face area and non-face area (

Eqs. (17) and(18) )⑤ Bandwidth update (

Eq. (19) )④ Goto ①

- 5. Experimental Results
The proposed algorithm is implemented using Intel Core i5 CPU 750 @ 2.67GHz laptop PC and ASUS ‘Xtion PRO LIVE. The real - time input image was 640 × 480 and 30 frames per second. The first row of Figure 3 shows a case in which the distance information is similar but different color distribution from the two cases with similar color but different distance information. The second row shows the Coupled Support Map for each case. White corresponds to 1, gray corresponds to 0, and black corresponds to −1. Through the Coupled Support Map, it can be seen that the pixel belongs to the face area but is divided into the background and the background.

In the color-based face tracking method, the mean-shift tracks the next location through the color distribution between the target and candidate models. Since objects are tracked only through color distribution, there is a risk that an object with a similar color distribution in the tracking area may enter, or that the target may be incorrectly determined for a variety of dynamically changing lighting conditions.

Figure 4 shows that the Coupled Support Map-based mean-shift tracker is robust to illumination changes. The first row in Figure 4 shows the face movements in the rapidly changing illumination, and the second row shows the face tracking results. The distance between the camera and the face changes, and the bandwidth of the tracker is adjusted appropriately according to the size of the face. The pixel ‘1’ in the Coupled Support Map is white.

- 6. Conclusion
In this paper, we propose a robust adaptive mean-shift tracking method based on Coupled Support Map when face size changes and illumination change occurs. The existing skin color based face tracking method was difficult to apply in the environment where the skin color model was different from the learning condition, and it was impossible to trace when the illumination changed during the face tracking.

When using the adaptive skin color model, it is possible to cope with general illumination change to some degree, but it has a limit of error accumulation according to time, and it is difficult to cope with excessive illumination change.

The proposed method is similar to the previous studies in terms of adoption of the adaptive color model. However, the face tracking area is divided into the face area and the non-face area, and the robustness is secured by using the Coupled Support Map using the color information and the distance information.

- Conflict of Interest
No potential conflict of interest relevant to this article was reported.

- References
- Cui, Y, Samarasckera, S, Huang, Q, and Greiffenhagen, M 1998. Indoor monitoring via the collaboration between a peripheral sensor and a foveal sensor., Proceedings of 1998 IEEE Workshop on Visual Surveillance, Bombay, India, Array.
- Park, JS, Kim, HT, and Yu, YS (2011). Video based fire detection algorithm using Gaussian mixture model. The Journal of the Korea Institute of Electronic Communication Sciences.
*6*, 206-211. - Kim, IS, and Shin, HS (2011). A study on development of intelligent CCTV security system based on BIM. The Journal of the Korea Institute of Electronic Communication Sciences.
*6*, 789-795. - Kim, IS, Yoo, JD, and Kim, BH (2008). A monitoring way and installation of monitoring system using intelligent CCTV under the U-city environment. The Journal of the Korea Institute of Electronic Communication Sciences.
*3*, 295-303. - Bradski, GR 1998. Real time face and object tracking as a component of a perceptual user interface., Proceedings of 4th IEEE Workshop on Applications of Computer Vision, Princeton, NJ, Array, pp.214-219.
- Chellappa, R, Zhou, SK, and Li, B 2002. Bayesian methods for face recognition from video., Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, Array.
- Eleftheriadis, A, and Jacquin, A (1995). Automatic face location detection and tracking for model-assisted coding of video teleconferencing sequences at low bit-rates. Signal Processing: Image Communication.
*7*, 231-248. - Lee, YH, Jeong, MH, Lee, JJ, and You, BJ (2008). Robust face tracking using bilateral filtering. Advanced Intelligent Computing Theories and Applications With Aspects of Theoretical and Methodological Issues, Huang, DS, Wunsch, DC, Levine, DS, and Jo, KH, ed. Berlin, Germany: Springer, pp. 1181-1189
- Comaniciu, D, Ramesh, V, and Meer, P (2003). Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.
*25*, 564-577. - Jeong, MH, You, BJ, and Lee, WH (2006). Color region tracking against brightness changes. AI 2006: Advances in Artificial Intelligence, Sattar, A, and Kang, B, ed. Berlin, Germany: Springer, pp. 536-545
- Viola, P, and Jones, M 2001. Rapid object detection using a boosted cascade of simple features., Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, Array, pp.511-518.
- Wren, CR, Azarbayejani, A, Darrell, T, and Pentland, AP (1997). Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence.
*19*, 780-785.

- Biographies
Yongwon Hwang received the M.S. degree from Department of Electronic Engineering for Control and Biomedical, Konkuk University and he is doing Ph.D. from the Department of Electronic Engineering for Intelligence Control System, Yonsei University, Korea. He has been a researcher at Korea Institute of Science and Technology since 2002. His major interests are robot vision, intelligence, cognitive science, and human-robot interaction.E-mail: dryongs@gmail.com

Mun-Ho Jeong received the B.S. and M.S. degrees from Korea Advanced Institute and Science and Technology (KAIST), Korea and his Ph.D. from the Department of Mechanical Engineering for Computer Controlled Systems, Osaka University, Japan. He has been a professor at the school of robotics in Kwangwoon University since 2010. His major interests are robot vision and intelligence, image processing, and human-robot interaction.E-mail: mhjeong@kw.ac.kr

Sang-Rok Oh joined Korea Institute of Science and Technology (KIST) in 1988 after graduation from Korea Advanced Institute of Science and Technology (KAIST) for his Ph.D. degree, and has been working as a principal research engineer at the center for robotics research, KIST. His research interests include information and communication technology, biorobotics for quality of life, and network based intelligent service robots.E-mail: sroh@kist.re.kr

Changyong Yoon received B.S., M.S., and Ph.D. degrees in Electrical and Electronic Engineering from Yonsei University, Seoul, Korea, in 1997, 1999, and 2010, respectively. He was a senior research engineer in LG Electronics, Inc., and LG-Nortel, and he developed system software for the DVR and WCDMA from 1999 to 2006. From 2010 to February 2012, he was a chief research engineer in LG Display and developed the circuit and algorithms in touch systems. Since 2012, he has been a Professor in the Department of Electrical Engineering, Suwon Science College, Gyeonggi-do, Korea. His main research interests include intelligent transportation systems, pattern recognition, robot vision, and fuzzy application systems.E-mail: cyyoon@ssc.ac.kr