International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127
Published online June 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.2.117
© The Korean Institute of Intelligent Systems
Minh-Thanh Vo and Seong G. Kong
Department of Computer Engineering, Sejong University, Seoul, Korea
Correspondence to :
Seong G. Kong (skong@sejong.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.
Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks
Eye gaze is an important nonverbal cue in human communication. Gaze estimation and tracking refer to the process of measuring and tracing the location or direction of the gaze point at which a human subject is looking. The gaze point reflects the focus of the user’s attention, making it a significant observable indicator for designing human-computer interaction systems. Vision-based gaze tracking devices often use a camera mounted on a computer monitor screen or mobile device to capture a video stream of the user’s eye movements while they use a device. An internal computing unit then processes the image data in each frame of the video stream to estimate the spatial coordinates of the gaze point on the monitor screen. Estimating the gaze point can be useful in a wide variety of applications, including human-computer interfaces, marketing surveys, and recommendation systems based on customer preferences. Gaze estimation helps designers create smart marketing analysis systems because the position and duration of the gaze point can reveal interest in a particular product when a customer browses a webpage in an online shopping mall. Major use cases vary from controlled gaze functions for assisting people with disabilities [1], driver support in automotive platforms [2], visual behavior analysis [3], student attention analysis, targeted delivery of advertisements, and cognitive studies [4].
Many successful approaches for estimating gaze points have been studied over the past few decades. Gaze estimation methods can be classified into three categories: geometric, feature-based, and appearance-based methods. Geometric methods use information on eye shape, including eye open-close states, eye size, and eye color, to determine gaze points [5–7]. These methods have achieved promising results, but often require multiple cameras and changes in illumination and types of light sources tend to reduce accuracy. Feature-based approaches [8–10] attempt to extract distinctive features such as contours, eye corners, and corneal reflections or glints. Such methods work well in controlled indoor environments. However, small eyes or facial attachments such as glasses, hats, and eyelashes can degrade gaze tracking performance. Appearance-based approaches have attracted significant attention because they have paved the way for gaze estimation in everyday settings and uncontrolled environments [11,12].
The recent successes of deep learning in object recognition has inspired its extension to gaze tracking domains. The latest appearance-based techniques use convolutional neural networks (CNNs) for gaze estimation [13]. Although mutual connections exist between eye or face images and gaze directions, such methods may only consider training datasets of static images and often do not consider temporal information when people change gaze their direction. Variations in facial features contain vital information that helps improve gaze estimation accuracy. By understanding the history of variations, the current gaze point can be estimated more precisely. Long short-term memory (LSTM) is a short-term memory model that persists for a long period of time, making it well-suited to classification and prediction problems with a time series of constraints. The research question addressed in this study is how we can apply the concept of LSTM to improve the accuracy of gaze estimation.
This paper presents a gaze estimation and tracking method using convolutional long short-term memory (C-LSTM) networks to encode temporal changes in facial features when a subject is naturally engaged in the act of looking through a set of information over a certain period of time using a computer or mobile device. The proposed scheme performs end-to-end training for a deep learning model to estimate gaze points. Given a s uence of input video frames from the camera mounted on a computer, the iTracker model [13] detects and locates regions of interest (ROIs) such as the left eye, right eye, face, and face grid. The face grid denotes a binary mask that is used to indicate the location and size of the face within an image. A set of convolutional layers in the iTracker model separately extract facial features from the ROIs. Popular CNN models pre-trained on large image datasets such as AlexNet, VGG, and GoogleNet can be used as alternative feature extractors [14] because meaningful visual information such as the edges, contours, and shapes of an object in an image remain unchanged across various problems. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset [13] to extract feature vectors. A sequence of facial feature vectors with known spatial coordinates for the corresponding gaze points was used to train the LSTM network to learn the relationships between changes in facial features reflecting eye movements, head rotations, and facial expressions, and the corresponding positions of the gaze points. After training, the current gaze point is estimated using the facial features of the current frame, as well as changes in the gaze point coordinates of previous frames. To compensate for the differences associated with different subjects, a calibration process based on support vector regression is performed to fine-tune the coordinates of the predicted gaze points. To evaluate the performance of gaze tracking, the proposed method is compared to three state-of-the-art appearance-based gaze tracking methods: the iTracker model [13], multimodal CNN model [15], and spatial weight CNN model [16]. Two standard databases are used for benchmarking: GazeCapture [13] and MPIIGaze [16]. The GazeCapture dataset is a large-scale dataset for eye tracking that contains data from over 1, 450 people with almost 2.5 million frames. The proposed method outperforms the state-of-the-art approaches after calibration with estimation errors of 0.82 and 0.92 cm on the horizontal and vertical axes, respectively. The proposed method was also tested on the MPIIGaze dataset and it achieved a low gaze angle error of 6.1°.
General procedures for gaze tracking include preprocessing, feature extraction, and estimation of the gaze point. The preprocessing step includes face detection to detect and crop the face regions in the scene. Then, various ROIs are selected, including the eyes [17] and pupils [18]. Head pose angles are used to compensate for the disparity caused by non-frontal head poses [19]. In the next step, features are extracted from an input image using different methods. The processed features are used in various regression functions such as support vector regression and linear regression to map features to predicted gaze points. Based on the advantages of deep learning and convolutional methods, CNN models can be trained and generalized to extract meaningful features from images for both low-level representation (e.g., edges, contours, and blobs) and high-level representation (e.g., heads and eyes) without human intervention. The fully connected layers in CNN models then serve as regression functions to map learned features to the predicted spatial coordinates of gaze points.
Geometric gaze tracking approaches attempt to infer gaze directions from the elliptical shape of the observed limbus or complex shape incorporating the eyelids. Funes Mora et al. [5] proposed a generative process to generate eye images and set up an understandable 3D gaze point. Lee et al. [7] proposed a calibration method utilizing the 3D geometrical relationship between the light source position and images reflected in various positions and poses. However, such methods require an external illumination source such as an infrared light, as well as a calibration process. Feature-based gaze tracking approaches find distinctive features such as contours, eye corners, cornea reflections, or glint flection. Valenti and Gevers [6] enhanced gaze estimation using prior knowledge regarding a scene to construct a probability distribution between the scene and gaze points. Manolova et al. [10] added an additional fixed camera into a multi-camera system to determine the position of the human face and components such as the eyes, and used this information to correct the eye landmarks and head pose. Lai et al. [20] used glints and contour features in a 3D gaze tracking method that directly estimates the line of sight in a 3D space. However, these approaches are affected by illumination changes, so they only work in controlled environments such as indoor settings. Appearance-based methods directly infer gaze locations from eye or face images. Early works required a fixed head pose and specific user for training. Several later approaches improved performance by focusing on pose-independent gaze estimation, but they still required specific people for training. Schneider et al. [11] exploited a manifold alignment scheme for person independence to improve gaze estimation. Yucel et al. [12] use Gaussian process regression and neural networks to interpolate gaze directions.
With recent advances in deep learning and large databases for gaze estimation, appearance-based approaches have attracted significant attention from researchers using deep learning schemes for gaze tracking. Based on the availability of large-scale datasets for gaze tracking collected in various settings in unconstrained environments, Krafka et al. [13] presented iTracker, which is a multi-region CNN model that captures both eye and face images as inputs to estimate gaze points. Zhang et al. [15] proposed a multimodal CNN model to take advantage of both eye images and head pose information. They further improved performance by introducing a full-face CNN model that integrates spatial weights to encode information from multiple regions into a standard CNN. However, these methods only process still images without taking advantage of the temporal information in a sequence of video frames. Recurrent neural networks (RNNs) can learn temporal information from a time sequence of data. However, when motion changes over a long period of time, an RNN has difficulties backpropagating error signals through a long-range temporal interval based on the vanishing gradient effect. LSTM [21] was developed as an extension of RNN by using memory cells to store, modify, and access internal states to mitigate the vanishing gradient problem. LSTM has achieved promising results in human activity recognition. Wu et al. [22] proposed a hybrid deep learning model for video classification in which spatial and short-term motion features were extracted by a CNN model and used to train an LSTM model. Sun et al. [23] proposed Lattice-LSTM, which extends LSTM by learning the independent hidden state transitions of memory cells for individual spatial locations. Donahue et al. [24] introduced a recurrent model that can be trained jointly to learn temporal dynamics and convolutional perceptual representations.
The proposed gaze tracking model consists of convolutional layers and an LSTM network for learning the temporal information of facial features from a sequence of video frames. LSTM is an RNN model designed specifically to resolve the problems of vanishing and exploding gradients that occur when computing backpropagation over time. Each cell in an LSTM network is composed of input, output, and forget gates. The cell remembers values over an arbitrary time interval using these three gates to regulate the flow of information in and out of the cell. Figure 1(a) presents general approaches to gaze tracking, where conventional image processing techniques are used to detect the face and ROIs such as eyes, pupil centers, and head poses for feature extraction. Subsequently, a regression model determines the location of the gaze point based on the extracted features. In CNN-based gaze tracking methods, a CNN serves as a feature extractor for extracting meaningful features. These features are evaluated through fully connected layers to determine the gaze point. Figure 1(b) presents the structure of the proposed gaze tracking method.
The proposed gaze tracking method uses a pre-trained CNN from the iTracker model to extract features and an LSTM to extract the temporal information of features. Given an input video frame
An LSTM network consists of an input gate
An input modulation gate is updated as
where
Here, Ö is an element-wise multiplication operator. The cell output of the LSTM is obtained as
The matrices
Two identical LSTM networks are concatenated to learn variations in facial features over time. The output feature vector
where
The C-LSTM network performs the task of predicting the coordinates (
where
The model is trained using a stochastic gradient descent optimizer over 80 epochs with a batch size of 256. Figure 4 presents the training and validation errors for the C-LSTM network. The model was monitored by checking the validation error after each epoch and by saving the best model with the lowest validation error. The model was executed on an NVIDIA DIGITs deep learning server with four TITAN-X GPUs with 12 GB of memory per GPU, 64 GB of DDR4 RAM, and an Intel Core i7-5930K 6-core 3.5 GHz desktop CPU. The model was implemented using the TensorFlow software. The model was written in Python and executed on the Ubuntu 16.04 operating system. To enable GPU support, the system requires an assortment of drivers and libraries for NVIDA GPUs. The following NVIDIA software must be installed on the system: NVIDIA GPU drivers (CUDA 9.0, which requires version 384.x or higher), the CUDA Toolkit (TensorFlow supports CUDA 9.0, CUPTI ships with the CUDA Toolkit and cuDNN SDK (≥ 7.2)).
The GazeCapture dataset consists of 1, 474 video clips from 1, 474 volunteers. Each volunteer was asked to use an iOS app called GazeCapture naturally on a mobile device, iPhone, or iPad. In each session, the application sequentially displayed 60 random points and required the user to focus on each point. Each point was displayed for 2 seconds before being randomly moved to another location. The task duration was approximately 5 minutes. This resulted in the generation of 1, 474 video clips with corresponding locations of gaze points on the monitor screen. Users were encouraged to move their heads continuously and change the distances to their mobile devices under a variety of pose, appearance, and illumination conditions.
The GazeCapture dataset was used to train the C-LSTM network. Among the 1, 474 video clips in the dataset, 900 were selected for training, 100 for validation, and 474 for testing. Each video was split into a set of sequences with six frames per sequence, resulting in 1, 490, 959 frames (
To analyze how different regions of the face contribute to the estimation of the gaze point, we visualized the internal outputs of the C-LSTM network to determine how an input is decomposed into the different filters learned by the network. Figure 5 presents the activation outputs of some convolution layers in the C-LSTM network for four sample image frames in the GazeCapture dataset. From top to bottom, Figure 5 shows image frames (224
To avoid overfitting and generalize the gaze estimator to additional datasets, we employed 10-fold stratified cross-validation. We split 1, 000 videos into 10 folds and used nine folds for training, and the last fold for validation. The average validation error was 0.946 cm for the predicted gaze point. Next, we compared the performance of our model to that of other gaze tracking methods based on deep learning. The iTracker model uses a multilayer CNN to capture multiple features extracted from multiple regions of the face, namely left and right eyes, face, and face grid. Zhang et al. [15] performed gaze estimation using multimodal CNNs combining eye images and head-pose information. They expanded this concept to a new model that can perform both 2D and 3D gaze estimation [7] by encoding full-face images in a CNN with spatial weights applied to the feature maps. Table 1 presents the standalone horizontal and vertical errors of the four methods evaluated on a testing dataset consisting of 474 videos in terms of mean absolute error (MAE), which is calculated as
where
The next experiment considered the MPIIGaze dataset [16], which contains various face images collected from 15 laptop computer users using built-in webcams for several months during their daily lives. Each volunteer was asked to look at a marker in the form of a small grey circle with a white dot in the center, which was displayed randomly on a blank monitor screen every 10 minutes. The volunteers were supposed to fixate their gaze point on the marker and hit the spacebar on the keyboard to confirm the focus of their gaze point on the marker. When the spacebar was pressed, the eye region was cropped to a rectangular image with dimensions of 146
We adopted the same approach as that proposed in [16] to prepare the input data for the proposed method. Specifically, we normalized the images in MPIIGaze by cropping eye images to a fixed resolution of 60
All other settings were the same. The normalized training dataset consisted of 630, 000 eye images (
We performed gaze tracking experiments in real time using a personal computer. The experimental setup included a 27 inch Samsung LCD monitor with screen dimensions of 62.62
Because the proposed model was trained on the GazeCapture dataset, the actual gaze points must be calibrated on the monitor screen being used. In each experiment session, a user sat in front of the monitor and was asked to look at a calibration marker on the monitor screen. Given a set of training points {
After the calibration process, we evaluated the accuracy of gaze tracking. The video frames captured for each user were fed into the proposed model to compute the spatial coordinates of the initial gaze point. The initial gaze points were then calibrated to obtain the final estimated gaze points on the monitor screen. Next, a volunteer was asked to gaze at a target marker on the screen. Ten target marker points were displayed on the monitor screen at random locations and each point was presented for 5 seconds. The predicted gaze points were recorded in this second session. Table 4 presents the gaze estimation errors in terms of MAE in centimeters between the estimated gaze points and actual locations for the male and female subjects. The gaze estimation accuracies were improved significantly compared to the iTracker model without using LSTM from 8.84 to 2.31 cm along the
A gaze tracking experiment session was conducted to generate a heat map representing the density of gaze points within different objects on a monitor screen. The gaze points estimated using the proposed method were aggregated based on location and we assigned a radius of influence to the gaze points. The radius of influence is set to 50 pixels. As the density of the gaze points increases in an area, the heat map displays a color indicating a higher intensity. The subjects were presented with an image containing three rabbits and three turtles on the monitor screen. They were asked to look at only the rabbits for 3 minutes. Figure 8 presents the images and heat maps of the movement of the predicted gaze points on different rabbits before and after the calibration process. As the density of the gaze points on the rabbits increases, a more saturated red color appears on the rabbits in the heat map, indicating regions in which the subjects were interested. Before calibration, the predicted gaze points fluctuated around the objects that the subjects were looking at, so the corresponding heat map regions are spread out. After calibration, the estimated gaze points become more stable and focused on the objects, so the heat map regions are highly concentrated.
This paper presented a C-LSTM network for enhancing gaze estimation and tracking. The proposed C-LSTM network learns temporal changes in facial features to improve the accuracy of gaze point estimation. Changes in facial features when a subject is engaged in looking at an object over a period of time can be used to estimate the position of the gaze point precisely. Given a sequence of input video frames, a set of convolutional layers extracts facial features from ROIs such as the left eye, right eye, face, and face grid of the subject. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset to extract feature vectors. After training, the current gaze point is estimated using the facial features of the current frame, as well as the changes in gaze point coordinates of previous frames. Experimental results demonstrated that the proposed C-LSTM network achieves a significant improvement in terms of the accuracy of gaze tracking. The proposed method was compared to state-of-the-art techniques using deep learning for gaze estimation on the GazeCapture dataset with MAEs of 0.82 and 0.92 cm along the
There are no potential conflicts of interest relevant to this article.
Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.
Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of
Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.
Locations of target markers for calibration on a monitor screen with dimensions of 62.6
Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.
Table 1. Performance comparisons based on the MAE (CM) of the proposed abd state-of-the-art methods.
Methods | ||
---|---|---|
iTracker [13] | 1.45 | 1.67 |
Multimodal CNN [15] | 2.53 | 2.38 |
CNN with spatial weights [16] | 1.53 | 1.71 |
Proposed | 0.82 | 0.92 |
Table 2. Mean-squared gaze angle error (EΘ) of the state-of-the-art-gaze angle estimation techniques on the MPIIgaze dataset with 45, 000 testing eye images.
Mean-squared gaze angle error, Eθ (°) | |
---|---|
SVR | 16.5 |
ALR | 16.4 |
kNN | 16.2 |
RF | 15.4 |
MinistNet | 13.9 |
Spatial weight CNN | 10.8 |
Multimodal CNN | 9.8 |
Proposed | 6.1 |
Table 3. Number of volunteers participating in the gaze tracking experiments.
With glasses | Without glasses | |
---|---|---|
Male | 3 | 2 |
Female | 1 | 4 |
Table 4. Gaze estimation errors (CM) of the proposed method for male and female subjects.
w/o LSTM | w/ LSTM | |||
---|---|---|---|---|
Male | 9.52 | 11.35 | 2.62 | 3.52 |
Female | 8.16 | 10.47 | 2.00 | 3.29 |
Average | 8.84 | 10.91 | 2.31 | 3.41 |
Table 5. Gaze esitimation errors (CM) of the proposed method for subjects with and without glasses.
w/o LSTM | w/LSTM | |||
---|---|---|---|---|
Glasses | 8.76 | 11.10 | 2.94 | 3.83 |
No glasses | 7.24 | 9.53 | 1.54 | 3.07 |
Average | 8.00 | 10.31 | 2.24 | 3.45 |
E-mail: vmthanh@sejong.ac.kr
E-mail: skong@sejong.edu
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127
Published online June 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.2.117
Copyright © The Korean Institute of Intelligent Systems.
Minh-Thanh Vo and Seong G. Kong
Department of Computer Engineering, Sejong University, Seoul, Korea
Correspondence to:Seong G. Kong (skong@sejong.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.
Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks
Eye gaze is an important nonverbal cue in human communication. Gaze estimation and tracking refer to the process of measuring and tracing the location or direction of the gaze point at which a human subject is looking. The gaze point reflects the focus of the user’s attention, making it a significant observable indicator for designing human-computer interaction systems. Vision-based gaze tracking devices often use a camera mounted on a computer monitor screen or mobile device to capture a video stream of the user’s eye movements while they use a device. An internal computing unit then processes the image data in each frame of the video stream to estimate the spatial coordinates of the gaze point on the monitor screen. Estimating the gaze point can be useful in a wide variety of applications, including human-computer interfaces, marketing surveys, and recommendation systems based on customer preferences. Gaze estimation helps designers create smart marketing analysis systems because the position and duration of the gaze point can reveal interest in a particular product when a customer browses a webpage in an online shopping mall. Major use cases vary from controlled gaze functions for assisting people with disabilities [1], driver support in automotive platforms [2], visual behavior analysis [3], student attention analysis, targeted delivery of advertisements, and cognitive studies [4].
Many successful approaches for estimating gaze points have been studied over the past few decades. Gaze estimation methods can be classified into three categories: geometric, feature-based, and appearance-based methods. Geometric methods use information on eye shape, including eye open-close states, eye size, and eye color, to determine gaze points [5–7]. These methods have achieved promising results, but often require multiple cameras and changes in illumination and types of light sources tend to reduce accuracy. Feature-based approaches [8–10] attempt to extract distinctive features such as contours, eye corners, and corneal reflections or glints. Such methods work well in controlled indoor environments. However, small eyes or facial attachments such as glasses, hats, and eyelashes can degrade gaze tracking performance. Appearance-based approaches have attracted significant attention because they have paved the way for gaze estimation in everyday settings and uncontrolled environments [11,12].
The recent successes of deep learning in object recognition has inspired its extension to gaze tracking domains. The latest appearance-based techniques use convolutional neural networks (CNNs) for gaze estimation [13]. Although mutual connections exist between eye or face images and gaze directions, such methods may only consider training datasets of static images and often do not consider temporal information when people change gaze their direction. Variations in facial features contain vital information that helps improve gaze estimation accuracy. By understanding the history of variations, the current gaze point can be estimated more precisely. Long short-term memory (LSTM) is a short-term memory model that persists for a long period of time, making it well-suited to classification and prediction problems with a time series of constraints. The research question addressed in this study is how we can apply the concept of LSTM to improve the accuracy of gaze estimation.
This paper presents a gaze estimation and tracking method using convolutional long short-term memory (C-LSTM) networks to encode temporal changes in facial features when a subject is naturally engaged in the act of looking through a set of information over a certain period of time using a computer or mobile device. The proposed scheme performs end-to-end training for a deep learning model to estimate gaze points. Given a s uence of input video frames from the camera mounted on a computer, the iTracker model [13] detects and locates regions of interest (ROIs) such as the left eye, right eye, face, and face grid. The face grid denotes a binary mask that is used to indicate the location and size of the face within an image. A set of convolutional layers in the iTracker model separately extract facial features from the ROIs. Popular CNN models pre-trained on large image datasets such as AlexNet, VGG, and GoogleNet can be used as alternative feature extractors [14] because meaningful visual information such as the edges, contours, and shapes of an object in an image remain unchanged across various problems. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset [13] to extract feature vectors. A sequence of facial feature vectors with known spatial coordinates for the corresponding gaze points was used to train the LSTM network to learn the relationships between changes in facial features reflecting eye movements, head rotations, and facial expressions, and the corresponding positions of the gaze points. After training, the current gaze point is estimated using the facial features of the current frame, as well as changes in the gaze point coordinates of previous frames. To compensate for the differences associated with different subjects, a calibration process based on support vector regression is performed to fine-tune the coordinates of the predicted gaze points. To evaluate the performance of gaze tracking, the proposed method is compared to three state-of-the-art appearance-based gaze tracking methods: the iTracker model [13], multimodal CNN model [15], and spatial weight CNN model [16]. Two standard databases are used for benchmarking: GazeCapture [13] and MPIIGaze [16]. The GazeCapture dataset is a large-scale dataset for eye tracking that contains data from over 1, 450 people with almost 2.5 million frames. The proposed method outperforms the state-of-the-art approaches after calibration with estimation errors of 0.82 and 0.92 cm on the horizontal and vertical axes, respectively. The proposed method was also tested on the MPIIGaze dataset and it achieved a low gaze angle error of 6.1°.
General procedures for gaze tracking include preprocessing, feature extraction, and estimation of the gaze point. The preprocessing step includes face detection to detect and crop the face regions in the scene. Then, various ROIs are selected, including the eyes [17] and pupils [18]. Head pose angles are used to compensate for the disparity caused by non-frontal head poses [19]. In the next step, features are extracted from an input image using different methods. The processed features are used in various regression functions such as support vector regression and linear regression to map features to predicted gaze points. Based on the advantages of deep learning and convolutional methods, CNN models can be trained and generalized to extract meaningful features from images for both low-level representation (e.g., edges, contours, and blobs) and high-level representation (e.g., heads and eyes) without human intervention. The fully connected layers in CNN models then serve as regression functions to map learned features to the predicted spatial coordinates of gaze points.
Geometric gaze tracking approaches attempt to infer gaze directions from the elliptical shape of the observed limbus or complex shape incorporating the eyelids. Funes Mora et al. [5] proposed a generative process to generate eye images and set up an understandable 3D gaze point. Lee et al. [7] proposed a calibration method utilizing the 3D geometrical relationship between the light source position and images reflected in various positions and poses. However, such methods require an external illumination source such as an infrared light, as well as a calibration process. Feature-based gaze tracking approaches find distinctive features such as contours, eye corners, cornea reflections, or glint flection. Valenti and Gevers [6] enhanced gaze estimation using prior knowledge regarding a scene to construct a probability distribution between the scene and gaze points. Manolova et al. [10] added an additional fixed camera into a multi-camera system to determine the position of the human face and components such as the eyes, and used this information to correct the eye landmarks and head pose. Lai et al. [20] used glints and contour features in a 3D gaze tracking method that directly estimates the line of sight in a 3D space. However, these approaches are affected by illumination changes, so they only work in controlled environments such as indoor settings. Appearance-based methods directly infer gaze locations from eye or face images. Early works required a fixed head pose and specific user for training. Several later approaches improved performance by focusing on pose-independent gaze estimation, but they still required specific people for training. Schneider et al. [11] exploited a manifold alignment scheme for person independence to improve gaze estimation. Yucel et al. [12] use Gaussian process regression and neural networks to interpolate gaze directions.
With recent advances in deep learning and large databases for gaze estimation, appearance-based approaches have attracted significant attention from researchers using deep learning schemes for gaze tracking. Based on the availability of large-scale datasets for gaze tracking collected in various settings in unconstrained environments, Krafka et al. [13] presented iTracker, which is a multi-region CNN model that captures both eye and face images as inputs to estimate gaze points. Zhang et al. [15] proposed a multimodal CNN model to take advantage of both eye images and head pose information. They further improved performance by introducing a full-face CNN model that integrates spatial weights to encode information from multiple regions into a standard CNN. However, these methods only process still images without taking advantage of the temporal information in a sequence of video frames. Recurrent neural networks (RNNs) can learn temporal information from a time sequence of data. However, when motion changes over a long period of time, an RNN has difficulties backpropagating error signals through a long-range temporal interval based on the vanishing gradient effect. LSTM [21] was developed as an extension of RNN by using memory cells to store, modify, and access internal states to mitigate the vanishing gradient problem. LSTM has achieved promising results in human activity recognition. Wu et al. [22] proposed a hybrid deep learning model for video classification in which spatial and short-term motion features were extracted by a CNN model and used to train an LSTM model. Sun et al. [23] proposed Lattice-LSTM, which extends LSTM by learning the independent hidden state transitions of memory cells for individual spatial locations. Donahue et al. [24] introduced a recurrent model that can be trained jointly to learn temporal dynamics and convolutional perceptual representations.
The proposed gaze tracking model consists of convolutional layers and an LSTM network for learning the temporal information of facial features from a sequence of video frames. LSTM is an RNN model designed specifically to resolve the problems of vanishing and exploding gradients that occur when computing backpropagation over time. Each cell in an LSTM network is composed of input, output, and forget gates. The cell remembers values over an arbitrary time interval using these three gates to regulate the flow of information in and out of the cell. Figure 1(a) presents general approaches to gaze tracking, where conventional image processing techniques are used to detect the face and ROIs such as eyes, pupil centers, and head poses for feature extraction. Subsequently, a regression model determines the location of the gaze point based on the extracted features. In CNN-based gaze tracking methods, a CNN serves as a feature extractor for extracting meaningful features. These features are evaluated through fully connected layers to determine the gaze point. Figure 1(b) presents the structure of the proposed gaze tracking method.
The proposed gaze tracking method uses a pre-trained CNN from the iTracker model to extract features and an LSTM to extract the temporal information of features. Given an input video frame
An LSTM network consists of an input gate
An input modulation gate is updated as
where
Here, Ö is an element-wise multiplication operator. The cell output of the LSTM is obtained as
The matrices
Two identical LSTM networks are concatenated to learn variations in facial features over time. The output feature vector
where
The C-LSTM network performs the task of predicting the coordinates (
where
The model is trained using a stochastic gradient descent optimizer over 80 epochs with a batch size of 256. Figure 4 presents the training and validation errors for the C-LSTM network. The model was monitored by checking the validation error after each epoch and by saving the best model with the lowest validation error. The model was executed on an NVIDIA DIGITs deep learning server with four TITAN-X GPUs with 12 GB of memory per GPU, 64 GB of DDR4 RAM, and an Intel Core i7-5930K 6-core 3.5 GHz desktop CPU. The model was implemented using the TensorFlow software. The model was written in Python and executed on the Ubuntu 16.04 operating system. To enable GPU support, the system requires an assortment of drivers and libraries for NVIDA GPUs. The following NVIDIA software must be installed on the system: NVIDIA GPU drivers (CUDA 9.0, which requires version 384.x or higher), the CUDA Toolkit (TensorFlow supports CUDA 9.0, CUPTI ships with the CUDA Toolkit and cuDNN SDK (≥ 7.2)).
The GazeCapture dataset consists of 1, 474 video clips from 1, 474 volunteers. Each volunteer was asked to use an iOS app called GazeCapture naturally on a mobile device, iPhone, or iPad. In each session, the application sequentially displayed 60 random points and required the user to focus on each point. Each point was displayed for 2 seconds before being randomly moved to another location. The task duration was approximately 5 minutes. This resulted in the generation of 1, 474 video clips with corresponding locations of gaze points on the monitor screen. Users were encouraged to move their heads continuously and change the distances to their mobile devices under a variety of pose, appearance, and illumination conditions.
The GazeCapture dataset was used to train the C-LSTM network. Among the 1, 474 video clips in the dataset, 900 were selected for training, 100 for validation, and 474 for testing. Each video was split into a set of sequences with six frames per sequence, resulting in 1, 490, 959 frames (
To analyze how different regions of the face contribute to the estimation of the gaze point, we visualized the internal outputs of the C-LSTM network to determine how an input is decomposed into the different filters learned by the network. Figure 5 presents the activation outputs of some convolution layers in the C-LSTM network for four sample image frames in the GazeCapture dataset. From top to bottom, Figure 5 shows image frames (224
To avoid overfitting and generalize the gaze estimator to additional datasets, we employed 10-fold stratified cross-validation. We split 1, 000 videos into 10 folds and used nine folds for training, and the last fold for validation. The average validation error was 0.946 cm for the predicted gaze point. Next, we compared the performance of our model to that of other gaze tracking methods based on deep learning. The iTracker model uses a multilayer CNN to capture multiple features extracted from multiple regions of the face, namely left and right eyes, face, and face grid. Zhang et al. [15] performed gaze estimation using multimodal CNNs combining eye images and head-pose information. They expanded this concept to a new model that can perform both 2D and 3D gaze estimation [7] by encoding full-face images in a CNN with spatial weights applied to the feature maps. Table 1 presents the standalone horizontal and vertical errors of the four methods evaluated on a testing dataset consisting of 474 videos in terms of mean absolute error (MAE), which is calculated as
where
The next experiment considered the MPIIGaze dataset [16], which contains various face images collected from 15 laptop computer users using built-in webcams for several months during their daily lives. Each volunteer was asked to look at a marker in the form of a small grey circle with a white dot in the center, which was displayed randomly on a blank monitor screen every 10 minutes. The volunteers were supposed to fixate their gaze point on the marker and hit the spacebar on the keyboard to confirm the focus of their gaze point on the marker. When the spacebar was pressed, the eye region was cropped to a rectangular image with dimensions of 146
We adopted the same approach as that proposed in [16] to prepare the input data for the proposed method. Specifically, we normalized the images in MPIIGaze by cropping eye images to a fixed resolution of 60
All other settings were the same. The normalized training dataset consisted of 630, 000 eye images (
We performed gaze tracking experiments in real time using a personal computer. The experimental setup included a 27 inch Samsung LCD monitor with screen dimensions of 62.62
Because the proposed model was trained on the GazeCapture dataset, the actual gaze points must be calibrated on the monitor screen being used. In each experiment session, a user sat in front of the monitor and was asked to look at a calibration marker on the monitor screen. Given a set of training points {
After the calibration process, we evaluated the accuracy of gaze tracking. The video frames captured for each user were fed into the proposed model to compute the spatial coordinates of the initial gaze point. The initial gaze points were then calibrated to obtain the final estimated gaze points on the monitor screen. Next, a volunteer was asked to gaze at a target marker on the screen. Ten target marker points were displayed on the monitor screen at random locations and each point was presented for 5 seconds. The predicted gaze points were recorded in this second session. Table 4 presents the gaze estimation errors in terms of MAE in centimeters between the estimated gaze points and actual locations for the male and female subjects. The gaze estimation accuracies were improved significantly compared to the iTracker model without using LSTM from 8.84 to 2.31 cm along the
A gaze tracking experiment session was conducted to generate a heat map representing the density of gaze points within different objects on a monitor screen. The gaze points estimated using the proposed method were aggregated based on location and we assigned a radius of influence to the gaze points. The radius of influence is set to 50 pixels. As the density of the gaze points increases in an area, the heat map displays a color indicating a higher intensity. The subjects were presented with an image containing three rabbits and three turtles on the monitor screen. They were asked to look at only the rabbits for 3 minutes. Figure 8 presents the images and heat maps of the movement of the predicted gaze points on different rabbits before and after the calibration process. As the density of the gaze points on the rabbits increases, a more saturated red color appears on the rabbits in the heat map, indicating regions in which the subjects were interested. Before calibration, the predicted gaze points fluctuated around the objects that the subjects were looking at, so the corresponding heat map regions are spread out. After calibration, the estimated gaze points become more stable and focused on the objects, so the heat map regions are highly concentrated.
This paper presented a C-LSTM network for enhancing gaze estimation and tracking. The proposed C-LSTM network learns temporal changes in facial features to improve the accuracy of gaze point estimation. Changes in facial features when a subject is engaged in looking at an object over a period of time can be used to estimate the position of the gaze point precisely. Given a sequence of input video frames, a set of convolutional layers extracts facial features from ROIs such as the left eye, right eye, face, and face grid of the subject. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset to extract feature vectors. After training, the current gaze point is estimated using the facial features of the current frame, as well as the changes in gaze point coordinates of previous frames. Experimental results demonstrated that the proposed C-LSTM network achieves a significant improvement in terms of the accuracy of gaze tracking. The proposed method was compared to state-of-the-art techniques using deep learning for gaze estimation on the GazeCapture dataset with MAEs of 0.82 and 0.92 cm along the
Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.
Feature extractor adopted from the iTracker model.
Signal flow in the gaze estimator based on an LSTM network.
Training and validation errors for the C-LSTM network.
Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of
Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.
Locations of target markers for calibration on a monitor screen with dimensions of 62.6
Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.
Table 2 . Mean-squared gaze angle error (EΘ) of the state-of-the-art-gaze angle estimation techniques on the MPIIgaze dataset with 45, 000 testing eye images.
Mean-squared gaze angle error, Eθ (°) | |
---|---|
SVR | 16.5 |
ALR | 16.4 |
kNN | 16.2 |
RF | 15.4 |
MinistNet | 13.9 |
Spatial weight CNN | 10.8 |
Multimodal CNN | 9.8 |
Proposed | 6.1 |
Table 3 . Number of volunteers participating in the gaze tracking experiments.
With glasses | Without glasses | |
---|---|---|
Male | 3 | 2 |
Female | 1 | 4 |
Table 4 . Gaze estimation errors (CM) of the proposed method for male and female subjects.
w/o LSTM | w/ LSTM | |||
---|---|---|---|---|
Male | 9.52 | 11.35 | 2.62 | 3.52 |
Female | 8.16 | 10.47 | 2.00 | 3.29 |
Average | 8.84 | 10.91 | 2.31 | 3.41 |
Table 5 . Gaze esitimation errors (CM) of the proposed method for subjects with and without glasses.
w/o LSTM | w/LSTM | |||
---|---|---|---|---|
Glasses | 8.76 | 11.10 | 2.94 | 3.83 |
No glasses | 7.24 | 9.53 | 1.54 | 3.07 |
Average | 8.00 | 10.31 | 2.24 | 3.45 |
Igor V. Arinichev, Sergey V. Polyanskikh, Galina V. Volkova, and Irina V. Arinicheva
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(1): 1-11 https://doi.org/10.5391/IJFIS.2021.21.1.1Azzaya Nomuunbayar, and Sanggil Kang
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(4): 333-338 https://doi.org/10.5391/IJFIS.2018.18.4.333Hyukdoo Choi
Int. J. Fuzzy Log. Intell. Syst. -0001; 17(2): 98-106 https://doi.org/10.5391/IJFIS.2017.17.2.98Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.
|@|~(^,^)~|@|Feature extractor adopted from the iTracker model.
|@|~(^,^)~|@|Signal flow in the gaze estimator based on an LSTM network.
|@|~(^,^)~|@|Training and validation errors for the C-LSTM network.
|@|~(^,^)~|@|Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of
Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.
|@|~(^,^)~|@|Locations of target markers for calibration on a monitor screen with dimensions of 62.6
Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.