Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127

Published online June 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.2.117

© The Korean Institute of Intelligent Systems

Enhanced Gaze Tracking Using Convolutional Long Short-Term Memory Networks

Minh-Thanh Vo and Seong G. Kong

Department of Computer Engineering, Sejong University, Seoul, Korea

Correspondence to :
Seong G. Kong (skong@sejong.edu)

Received: June 16, 2022; Revised: March 6, 2022; Accepted: March 29, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.

Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks

Eye gaze is an important nonverbal cue in human communication. Gaze estimation and tracking refer to the process of measuring and tracing the location or direction of the gaze point at which a human subject is looking. The gaze point reflects the focus of the user’s attention, making it a significant observable indicator for designing human-computer interaction systems. Vision-based gaze tracking devices often use a camera mounted on a computer monitor screen or mobile device to capture a video stream of the user’s eye movements while they use a device. An internal computing unit then processes the image data in each frame of the video stream to estimate the spatial coordinates of the gaze point on the monitor screen. Estimating the gaze point can be useful in a wide variety of applications, including human-computer interfaces, marketing surveys, and recommendation systems based on customer preferences. Gaze estimation helps designers create smart marketing analysis systems because the position and duration of the gaze point can reveal interest in a particular product when a customer browses a webpage in an online shopping mall. Major use cases vary from controlled gaze functions for assisting people with disabilities [1], driver support in automotive platforms [2], visual behavior analysis [3], student attention analysis, targeted delivery of advertisements, and cognitive studies [4].

Many successful approaches for estimating gaze points have been studied over the past few decades. Gaze estimation methods can be classified into three categories: geometric, feature-based, and appearance-based methods. Geometric methods use information on eye shape, including eye open-close states, eye size, and eye color, to determine gaze points [57]. These methods have achieved promising results, but often require multiple cameras and changes in illumination and types of light sources tend to reduce accuracy. Feature-based approaches [810] attempt to extract distinctive features such as contours, eye corners, and corneal reflections or glints. Such methods work well in controlled indoor environments. However, small eyes or facial attachments such as glasses, hats, and eyelashes can degrade gaze tracking performance. Appearance-based approaches have attracted significant attention because they have paved the way for gaze estimation in everyday settings and uncontrolled environments [11,12].

The recent successes of deep learning in object recognition has inspired its extension to gaze tracking domains. The latest appearance-based techniques use convolutional neural networks (CNNs) for gaze estimation [13]. Although mutual connections exist between eye or face images and gaze directions, such methods may only consider training datasets of static images and often do not consider temporal information when people change gaze their direction. Variations in facial features contain vital information that helps improve gaze estimation accuracy. By understanding the history of variations, the current gaze point can be estimated more precisely. Long short-term memory (LSTM) is a short-term memory model that persists for a long period of time, making it well-suited to classification and prediction problems with a time series of constraints. The research question addressed in this study is how we can apply the concept of LSTM to improve the accuracy of gaze estimation.

This paper presents a gaze estimation and tracking method using convolutional long short-term memory (C-LSTM) networks to encode temporal changes in facial features when a subject is naturally engaged in the act of looking through a set of information over a certain period of time using a computer or mobile device. The proposed scheme performs end-to-end training for a deep learning model to estimate gaze points. Given a s uence of input video frames from the camera mounted on a computer, the iTracker model [13] detects and locates regions of interest (ROIs) such as the left eye, right eye, face, and face grid. The face grid denotes a binary mask that is used to indicate the location and size of the face within an image. A set of convolutional layers in the iTracker model separately extract facial features from the ROIs. Popular CNN models pre-trained on large image datasets such as AlexNet, VGG, and GoogleNet can be used as alternative feature extractors [14] because meaningful visual information such as the edges, contours, and shapes of an object in an image remain unchanged across various problems. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset [13] to extract feature vectors. A sequence of facial feature vectors with known spatial coordinates for the corresponding gaze points was used to train the LSTM network to learn the relationships between changes in facial features reflecting eye movements, head rotations, and facial expressions, and the corresponding positions of the gaze points. After training, the current gaze point is estimated using the facial features of the current frame, as well as changes in the gaze point coordinates of previous frames. To compensate for the differences associated with different subjects, a calibration process based on support vector regression is performed to fine-tune the coordinates of the predicted gaze points. To evaluate the performance of gaze tracking, the proposed method is compared to three state-of-the-art appearance-based gaze tracking methods: the iTracker model [13], multimodal CNN model [15], and spatial weight CNN model [16]. Two standard databases are used for benchmarking: GazeCapture [13] and MPIIGaze [16]. The GazeCapture dataset is a large-scale dataset for eye tracking that contains data from over 1, 450 people with almost 2.5 million frames. The proposed method outperforms the state-of-the-art approaches after calibration with estimation errors of 0.82 and 0.92 cm on the horizontal and vertical axes, respectively. The proposed method was also tested on the MPIIGaze dataset and it achieved a low gaze angle error of 6.1°.

General procedures for gaze tracking include preprocessing, feature extraction, and estimation of the gaze point. The preprocessing step includes face detection to detect and crop the face regions in the scene. Then, various ROIs are selected, including the eyes [17] and pupils [18]. Head pose angles are used to compensate for the disparity caused by non-frontal head poses [19]. In the next step, features are extracted from an input image using different methods. The processed features are used in various regression functions such as support vector regression and linear regression to map features to predicted gaze points. Based on the advantages of deep learning and convolutional methods, CNN models can be trained and generalized to extract meaningful features from images for both low-level representation (e.g., edges, contours, and blobs) and high-level representation (e.g., heads and eyes) without human intervention. The fully connected layers in CNN models then serve as regression functions to map learned features to the predicted spatial coordinates of gaze points.

Geometric gaze tracking approaches attempt to infer gaze directions from the elliptical shape of the observed limbus or complex shape incorporating the eyelids. Funes Mora et al. [5] proposed a generative process to generate eye images and set up an understandable 3D gaze point. Lee et al. [7] proposed a calibration method utilizing the 3D geometrical relationship between the light source position and images reflected in various positions and poses. However, such methods require an external illumination source such as an infrared light, as well as a calibration process. Feature-based gaze tracking approaches find distinctive features such as contours, eye corners, cornea reflections, or glint flection. Valenti and Gevers [6] enhanced gaze estimation using prior knowledge regarding a scene to construct a probability distribution between the scene and gaze points. Manolova et al. [10] added an additional fixed camera into a multi-camera system to determine the position of the human face and components such as the eyes, and used this information to correct the eye landmarks and head pose. Lai et al. [20] used glints and contour features in a 3D gaze tracking method that directly estimates the line of sight in a 3D space. However, these approaches are affected by illumination changes, so they only work in controlled environments such as indoor settings. Appearance-based methods directly infer gaze locations from eye or face images. Early works required a fixed head pose and specific user for training. Several later approaches improved performance by focusing on pose-independent gaze estimation, but they still required specific people for training. Schneider et al. [11] exploited a manifold alignment scheme for person independence to improve gaze estimation. Yucel et al. [12] use Gaussian process regression and neural networks to interpolate gaze directions.

With recent advances in deep learning and large databases for gaze estimation, appearance-based approaches have attracted significant attention from researchers using deep learning schemes for gaze tracking. Based on the availability of large-scale datasets for gaze tracking collected in various settings in unconstrained environments, Krafka et al. [13] presented iTracker, which is a multi-region CNN model that captures both eye and face images as inputs to estimate gaze points. Zhang et al. [15] proposed a multimodal CNN model to take advantage of both eye images and head pose information. They further improved performance by introducing a full-face CNN model that integrates spatial weights to encode information from multiple regions into a standard CNN. However, these methods only process still images without taking advantage of the temporal information in a sequence of video frames. Recurrent neural networks (RNNs) can learn temporal information from a time sequence of data. However, when motion changes over a long period of time, an RNN has difficulties backpropagating error signals through a long-range temporal interval based on the vanishing gradient effect. LSTM [21] was developed as an extension of RNN by using memory cells to store, modify, and access internal states to mitigate the vanishing gradient problem. LSTM has achieved promising results in human activity recognition. Wu et al. [22] proposed a hybrid deep learning model for video classification in which spatial and short-term motion features were extracted by a CNN model and used to train an LSTM model. Sun et al. [23] proposed Lattice-LSTM, which extends LSTM by learning the independent hidden state transitions of memory cells for individual spatial locations. Donahue et al. [24] introduced a recurrent model that can be trained jointly to learn temporal dynamics and convolutional perceptual representations.

3.1 LSTM Networks

The proposed gaze tracking model consists of convolutional layers and an LSTM network for learning the temporal information of facial features from a sequence of video frames. LSTM is an RNN model designed specifically to resolve the problems of vanishing and exploding gradients that occur when computing backpropagation over time. Each cell in an LSTM network is composed of input, output, and forget gates. The cell remembers values over an arbitrary time interval using these three gates to regulate the flow of information in and out of the cell. Figure 1(a) presents general approaches to gaze tracking, where conventional image processing techniques are used to detect the face and ROIs such as eyes, pupil centers, and head poses for feature extraction. Subsequently, a regression model determines the location of the gaze point based on the extracted features. In CNN-based gaze tracking methods, a CNN serves as a feature extractor for extracting meaningful features. These features are evaluated through fully connected layers to determine the gaze point. Figure 1(b) presents the structure of the proposed gaze tracking method.

The proposed gaze tracking method uses a pre-trained CNN from the iTracker model to extract features and an LSTM to extract the temporal information of features. Given an input video frame I(t) with dimension of 224 × 224 pixels at time t, a set of convolutional layers in the iTracker model serve as facial feature extractors for the ROIs. The iTracker CNN was trained using the GazeCapture dataset, which contains 2.5 million images of human faces. The iTracker CNN detected ROIs of the face, such as the face region cropped to a size of 64 × 64 pixels, both eye regions with dimensions of 64 × 64 pixels, and a face grid with dimensions of 64 × 64 pixels representing the location of the face in each frame. Figure 2 presents the computational blocks of the iTracker model. For feature extraction from the eyes, the four convolutional layers are composed of filters with dimensions of 11×11 (96 kernels), 5×5 (256 kernels), 3×3 (384 kernels), and 1×1 (64 kernels). For the face features, the five convolutional layers have the same filter dimensions as those for the eyes. The iTracker CNN transfers the features extracted using the convolutional layers to fully connected layers with sizes of 256, 256, 128, 400, and 400. The outputs from FC-E1, FC-F2, and FC-FG2 are concatenated to form a feature vector x(t) of size 784×1. The feature vector x(t) encodes distinctive facial information such as the face shape, color, texture, and eye regions in the embedded space.

An LSTM network consists of an input gate i(t), forget gate f(t), output gate o(t), input modulation gate g(t), and internal memory cell c(t). Let x(t) be the extracted feature at time t. The previous cell output h(t−1) and previous internal memory state c(t − 1) are concatenated together and enter the LSTM cell. Let σ(x) = 1/(1 + ex) be a sigmoid function. The gate signals at time step t in an LSTM network are updated as

i(t)=σ(Wxix(t)+Whih(t-1)+bi),f(t)=σ(Wxfx(t)+Whfh(t-1)+bf),o(t)=σ(Wxox(t)+Whoh(t-1)+bo).

An input modulation gate is updated as

g(t)=φ(Wxgx(t)+Whgh(t-1)+bg),

where φ(x) = (exex)/(ex+ex) is the hyperbolic tangent function. The internal memory cell unit c(t) is the sum of the previous memory cell units c(t−1) modulated by f(t) and i(t) modulated by the input gate g(t).

c(t)=f(t)c(t-1)+g(t)i(t).

Here, Ö is an element-wise multiplication operator. The cell output of the LSTM is obtained as

h(t)=o(t)φ(c(t)).

The matrices Wxi, Wxf , Wxo, Wxg denote the weights for the input feature x(t) corresponding to the input, forget, output, and input modulation gates, respectively. Similarly, the matrices Whi, Whf , Who, Whg are the weights for the previous cell output h(t − 1) and the vectors bi, bf , bo, bg are the corresponding biases. The internal memory state c(t) allows the network to learn when to forget the previous cell output and when to update the cell output given new information. Because i(t) and f(t) are sigmoidal, their values lie within the range [0, 1]. i(t) and f(t) are considered as tunable parameters that determine when the LSTM learns to forget its previous memory or consider its current input selectively. Similarly, the output gate o(t) learns how much of the memory cell is transferred to the output cell. Figure 3 presents a signal flow diagram for the LSTM network.

Two identical LSTM networks are concatenated to learn variations in facial features over time. The output feature vector h(t) of size 500 × 1 from the LSTM network is fed into two fully connected layers denoted as FC1 and FC2 to map the facial features to the location of the estimated gaze point on the screen. FC1 is composed of 128 hidden units and FC2 contains two hidden units. FC1 processes the feature vector h(t) using a rectified linear unit function ReLU(x) = max(x, 0) in the activation layer. FC2 linearly transforms the output of FC1 into the position of the gaze point (û(t), (t)) as follows:

h1(t)=ReLU(W1h(t)+b1),[u^(t)v^(t)]=W2h1(t)+b2,

where W1, W2 denote kernel matrices and b1, b2 are the biases of FC1 and FC2, respectively.

3.2 Training the C-LSTM Network for Gaze Estimation

The C-LSTM network performs the task of predicting the coordinates (û (t), (t)) of a subject’s gaze point in an image frame. The C-LSTM network is trained to minimize the mean-squared error between the predicted gaze point (û (t), (t)) and ground-truth gaze point (u(t), v(t)) at time t as

ɛ=1Nt=1N[(u(t)-u^(t))2+(v(t)-v^(t))2],

where N denotes the total number of image frames used for training.

The model is trained using a stochastic gradient descent optimizer over 80 epochs with a batch size of 256. Figure 4 presents the training and validation errors for the C-LSTM network. The model was monitored by checking the validation error after each epoch and by saving the best model with the lowest validation error. The model was executed on an NVIDIA DIGITs deep learning server with four TITAN-X GPUs with 12 GB of memory per GPU, 64 GB of DDR4 RAM, and an Intel Core i7-5930K 6-core 3.5 GHz desktop CPU. The model was implemented using the TensorFlow software. The model was written in Python and executed on the Ubuntu 16.04 operating system. To enable GPU support, the system requires an assortment of drivers and libraries for NVIDA GPUs. The following NVIDIA software must be installed on the system: NVIDIA GPU drivers (CUDA 9.0, which requires version 384.x or higher), the CUDA Toolkit (TensorFlow supports CUDA 9.0, CUPTI ships with the CUDA Toolkit and cuDNN SDK (≥ 7.2)).

4.1 Validation with Benchmarking Datasets

The GazeCapture dataset consists of 1, 474 video clips from 1, 474 volunteers. Each volunteer was asked to use an iOS app called GazeCapture naturally on a mobile device, iPhone, or iPad. In each session, the application sequentially displayed 60 random points and required the user to focus on each point. Each point was displayed for 2 seconds before being randomly moved to another location. The task duration was approximately 5 minutes. This resulted in the generation of 1, 474 video clips with corresponding locations of gaze points on the monitor screen. Users were encouraged to move their heads continuously and change the distances to their mobile devices under a variety of pose, appearance, and illumination conditions.

The GazeCapture dataset was used to train the C-LSTM network. Among the 1, 474 video clips in the dataset, 900 were selected for training, 100 for validation, and 474 for testing. Each video was split into a set of sequences with six frames per sequence, resulting in 1, 490, 959 frames (N = 1, 490, 959) containing both the face and eyes being selected from a total of 2, 445, 504 frames. Figure 5 presents four sample subjects in the GazeCapture dataset. The images were captured in an unconstrained environment under various location and illumination conditions. The CNN model of iTracker extracts the facial features from the sequences. These features are then used to optimize the weight parameters of the LSTM network and fully connected layers in the proposed model. The proposed model provides excellent performance with an average gaze point error less 0.5 cm. To improve performance further, the dataset must be diversified.

To analyze how different regions of the face contribute to the estimation of the gaze point, we visualized the internal outputs of the C-LSTM network to determine how an input is decomposed into the different filters learned by the network. Figure 5 presents the activation outputs of some convolution layers in the C-LSTM network for four sample image frames in the GazeCapture dataset. From top to bottom, Figure 5 shows image frames (224 × 224 pixels), convolutional layers C-F1 (224×224 pixels) and C-F3 (112×112 pixels), and the activation output x(t). The bottom row presents a 2D square image (28 × 28 pixels) representation of the feature vector x(t) with dimensions of 784×1. The model works as expected by detecting some useful information in the face images, including the shape, nose, eyes, and mouth, and encoding them as features.

To avoid overfitting and generalize the gaze estimator to additional datasets, we employed 10-fold stratified cross-validation. We split 1, 000 videos into 10 folds and used nine folds for training, and the last fold for validation. The average validation error was 0.946 cm for the predicted gaze point. Next, we compared the performance of our model to that of other gaze tracking methods based on deep learning. The iTracker model uses a multilayer CNN to capture multiple features extracted from multiple regions of the face, namely left and right eyes, face, and face grid. Zhang et al. [15] performed gaze estimation using multimodal CNNs combining eye images and head-pose information. They expanded this concept to a new model that can perform both 2D and 3D gaze estimation [7] by encoding full-face images in a CNN with spatial weights applied to the feature maps. Table 1 presents the standalone horizontal and vertical errors of the four methods evaluated on a testing dataset consisting of 474 videos in terms of mean absolute error (MAE), which is calculated as

MAEu=1Nt=1N|u(t)-u^(t)|,

where u(t) and û (t) denote the actual and predicted locations of the gaze point along the u axis, respectively, and N is the number of video frames used. The MAE calculation was repeated for the v axis. We implemented the methods proposed in [15] using only eye images extracted from a sequence of image frames to ensure compatibility with the GazeCapture dataset. Table 1 reveals that the proposed method outperforms the iTracker model and two models proposed by Zhang et al. [15,16]. Our proposed scheme outperforms state-of-the-art techniques because it uses the information learned from previous frames to improve the accuracy of gaze estimation in the current frame.

The next experiment considered the MPIIGaze dataset [16], which contains various face images collected from 15 laptop computer users using built-in webcams for several months during their daily lives. Each volunteer was asked to look at a marker in the form of a small grey circle with a white dot in the center, which was displayed randomly on a blank monitor screen every 10 minutes. The volunteers were supposed to fixate their gaze point on the marker and hit the spacebar on the keyboard to confirm the focus of their gaze point on the marker. When the spacebar was pressed, the eye region was cropped to a rectangular image with dimensions of 146 × 50 pixels using detection results based on the face detector and facial landmark detector. The images were then resized to 224 × 224 pixels using bilinear interpolation to increase the image size. A sequence of positions of a marker acting as a target and the actual gaze point were recorded on a laptop screen with a spatial resolution of 1, 280 × 720 pixels. We modified the iTracker CNN model to consider eye images only and removed the convolution layers for the face. Figure 6 presents sample images of a subject wearing eyeglasses from the MPIIGaze dataset with dimensions of 146 × 50 pixels and the features represented by the activation outputs of convolutional layers C-F1 (224 × 224 pixels) and C-E3 (112 × 112 pixels), and fully connected layer FC-E1 (28 × 28 pixels) of the proposed model. These five eye images corresponded to the positions of the target marker on the screen at pixels (503, 470), (600, 450), (730, 403), (831, 400), and (910, 434).

We adopted the same approach as that proposed in [16] to prepare the input data for the proposed method. Specifically, we normalized the images in MPIIGaze by cropping eye images to a fixed resolution of 60 × 36 pixels. The eye images were then histogram-equalized to form the input images. Ground-truth gaze positions were also converted into the normalized camera space to obtain a gaze angle vector that consisted of pitch (nodding) θx and yaw (shaking) θy angles. To apply the proposed method to the MPIIGaze dataset, we modified the loss function to use the mean-squared error between the estimated gaze angle vectors (θ̂x(t), θ̂y(t)) and actual gaze angle vectors (θx(t), θy(t)) at time t as follows:

ɛθ=1Mt=1M[(θx(t)-θ^x(t))2+(θy(t)-θ^y(t))2].

All other settings were the same. The normalized training dataset consisted of 630, 000 eye images (M = 630, 000), and the testing images were 3, 000 eye images captured from the left and right eyes in MPIIGaze. We used 80% of the images in the training dataset for training and 20% of the images for validation. We compared the results to those of the other methods reported in [16], namely support vector regression (SVR), adaptive learning regression (ALR), k-nearest neighbors (kNN), a random forest (RF), MinistNet, CNN model with spatial weights, and multimodal CNN. Table 2 lists the mean-squared gaze angle errors of the various methods evaluated on the MPIIGaze dataset. The proposed method the smallest error of 6.1° on the dataset. This represents significant performance improvements of 4.7° and 3.7° compared to the two state-of-the-art methods, namely the multimodal CNN and spatial weight CNN, respectively.

4.2 Real-Time Gaze Estimation

We performed gaze tracking experiments in real time using a personal computer. The experimental setup included a 27 inch Samsung LCD monitor with screen dimensions of 62.62×47.38 cm and a screen resolution of 1, 920×1, 080 pixels, an attached Logitech C310 webcam with a resolution of 1, 280 × 720 pixels, and personal computer with an Intel Core i5-7600 CPU, GeForce GTX 1060 6 GB GPU, and 16 GB of RAM. The camera was mounted on top of the monitor screen at approximately the eye level of the computer user. The camera captured user activity in real time at a rate of 30 fps. The predicted gaze point was estimated based on the current frame image and temporal information retrieved from the previous five frames. To evaluate the accuracy of the estimated gaze points, the monitor displayed calibration makers sequentially at nine fixed locations (marked with “+” symbols) on the screen, equally spaced by 440 pixels vertically and 840 pixels horizontally, which are equivalent to 19.3 and 27.4 cm, respectively. Figure 7 presents the experimental setup and locations of the markers on the monitor screen. In the figure, the red dot represents the current location of the target marker and the orange cross in the small circle indicates the location of the current gaze point. During data collection sessions, the marker appeared at one location at a time and moved to the next location after 5 seconds. This continued until the calibration marker appeared at all locations. The volunteers participated in the experiment while sitting in front of the monitor at a distance of approximately 50 cm to 1 m. The evaluation was conducted with 10 volunteers (five males and five females) in two settings, namely wearing glasses and not wearing glasses, as shown in Table 3.

Because the proposed model was trained on the GazeCapture dataset, the actual gaze points must be calibrated on the monitor screen being used. In each experiment session, a user sat in front of the monitor and was asked to look at a calibration marker on the monitor screen. Given a set of training points {(u^1,u1c),,(u^n,unc)}, ûi denotes the x coordinate of the recorded predicted gaze point and uic is the corresponding x coordinate of the calibration point, where n is the number of training points recorded in the calibration process.

After the calibration process, we evaluated the accuracy of gaze tracking. The video frames captured for each user were fed into the proposed model to compute the spatial coordinates of the initial gaze point. The initial gaze points were then calibrated to obtain the final estimated gaze points on the monitor screen. Next, a volunteer was asked to gaze at a target marker on the screen. Ten target marker points were displayed on the monitor screen at random locations and each point was presented for 5 seconds. The predicted gaze points were recorded in this second session. Table 4 presents the gaze estimation errors in terms of MAE in centimeters between the estimated gaze points and actual locations for the male and female subjects. The gaze estimation accuracies were improved significantly compared to the iTracker model without using LSTM from 8.84 to 2.31 cm along the x axis and from 10.91 to 3.41 cm along the y axis on average. Table 5 presents the gaze estimation errors in terms of MAE in centimeters between the estimated gaze points and actual locations for subjects wearing glasses and not wearing glasses. The gaze estimation accuracies improved significantly with LSTM from 8.00 to 2.24 cm along the x axis and from 10.31 to 3.45 cm along the y axis on average.

A gaze tracking experiment session was conducted to generate a heat map representing the density of gaze points within different objects on a monitor screen. The gaze points estimated using the proposed method were aggregated based on location and we assigned a radius of influence to the gaze points. The radius of influence is set to 50 pixels. As the density of the gaze points increases in an area, the heat map displays a color indicating a higher intensity. The subjects were presented with an image containing three rabbits and three turtles on the monitor screen. They were asked to look at only the rabbits for 3 minutes. Figure 8 presents the images and heat maps of the movement of the predicted gaze points on different rabbits before and after the calibration process. As the density of the gaze points on the rabbits increases, a more saturated red color appears on the rabbits in the heat map, indicating regions in which the subjects were interested. Before calibration, the predicted gaze points fluctuated around the objects that the subjects were looking at, so the corresponding heat map regions are spread out. After calibration, the estimated gaze points become more stable and focused on the objects, so the heat map regions are highly concentrated.

This paper presented a C-LSTM network for enhancing gaze estimation and tracking. The proposed C-LSTM network learns temporal changes in facial features to improve the accuracy of gaze point estimation. Changes in facial features when a subject is engaged in looking at an object over a period of time can be used to estimate the position of the gaze point precisely. Given a sequence of input video frames, a set of convolutional layers extracts facial features from ROIs such as the left eye, right eye, face, and face grid of the subject. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset to extract feature vectors. After training, the current gaze point is estimated using the facial features of the current frame, as well as the changes in gaze point coordinates of previous frames. Experimental results demonstrated that the proposed C-LSTM network achieves a significant improvement in terms of the accuracy of gaze tracking. The proposed method was compared to state-of-the-art techniques using deep learning for gaze estimation on the GazeCapture dataset with MAEs of 0.82 and 0.92 cm along the x and y axes, respectively. The proposed method was also tested on the MPIIGaze dataset and it achieved a low gaze angle error of 6.1°. Additionally, the proposed model was tested in real time at a rate of 30 fps to visualize gaze points and generate heat maps. In the future, we plan to diversify our datasets to improve gaze estimation accuracy further.

This research was supported by the Faculty Research Fund of Sejong University (2021).

Fig. 1.

Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.

Fig. 2.

Feature extractor adopted from the iTracker model.

Fig. 3.

Signal flow in the gaze estimator based on an LSTM network.

Fig. 4.

Training and validation errors for the C-LSTM network.

Fig. 5.

Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of x(t).

Fig. 6.

Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.

Fig. 7.

Locations of target markers for calibration on a monitor screen with dimensions of 62.6 × 47.4 cm (1920 × 1080 pixels). Nine (3 × 3) locations are equally spaced on the monitor screen.

Fig. 8.

Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.

Table. 1.

Table 1. Performance comparisons based on the MAE (CM) of the proposed abd state-of-the-art methods.

Methodsx-axisy-axis
iTracker [13]1.45 ± 1.431.67 ± 1.62
Multimodal CNN [15]2.53 ± 1.872.38 ± 1.75
CNN with spatial weights [16]1.53 ± 1.541.71 ± 1.67
Proposed0.82 ± 1.120.92 ± 1.22

Table. 2.

Table 2. Mean-squared gaze angle error (EΘ) of the state-of-the-art-gaze angle estimation techniques on the MPIIgaze dataset with 45, 000 testing eye images.

Mean-squared gaze angle error, Eθ (°)
SVR16.5
ALR16.4
kNN16.2
RF15.4
MinistNet13.9
Spatial weight CNN10.8
Multimodal CNN9.8
Proposed6.1

Table. 3.

Table 3. Number of volunteers participating in the gaze tracking experiments.

With glassesWithout glasses
Male32
Female14

Table. 4.

Table 4. Gaze estimation errors (CM) of the proposed method for male and female subjects.

w/o LSTMw/ LSTM
x axisy axisx axisy axis
Male9.5211.352.623.52
Female8.1610.472.003.29
Average8.8410.912.313.41

Table. 5.

Table 5. Gaze esitimation errors (CM) of the proposed method for subjects with and without glasses.

w/o LSTMw/LSTM
x axisy axisx axisy axis
Glasses8.7611.102.943.83
No glasses7.249.531.543.07
Average8.0010.312.243.45

1. Guestrin, ED, and Eizenman, M . Remote point-of-gaze estimation requiring a single-point calibration for applications with infants., Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, 2008, Savannah, GA, Array, pp.267-274. https://doi.org/10.1145/1344471.1344531
2. Liang, Y, Reyes, ML, and Lee, JD (2007). Real-time detection of driver cognitive distraction using support vector machines. IEEE Transactions on Intelligent Transportation Systems. 8, 340-350. https://doi.org/10.1109/TITS.2007.895298
3. Morimoto, CH, and Mimica, MR (2005). Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding. 98, 4-24. https://doi.org/10.1016/j.cviu.2004.07.010
4. Duchowski, AT (2002). A breadth-first survey of eye-tracking applications. Behavior Research Methods, Instruments, & Computers. 34, 455-470. https://doi.org/10.3758/BF03195475
5. Funes Mora, KA, and Odobez, JM . Geometric generative gaze estimation (g3e) for remote RGB-D cameras., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, Columbus, OH, Array, pp.1773-1780. https://doi.org/10.1109/cvpr.2014.229
6. Valenti, R, and Gevers, T . Accurate eye center location and tracking using isophote curvature., Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, Anchorage, AK, Array, pp.1-8. https://doi.org/10.1109/cvpr.2008.4587529
7. Lee, H, Iqbal, N, Chang, W, and Lee, SY (2013). A calibration method for eye-gaze estimation systems based on 3D geometrical optics. IEEE Sensors Journal. 13, 3219-3225. https://doi.org/10.1109/JSEN.2013.2268247
8. Valenti, R, Sebe, N, and amd Gevers, T (2012). What are you looking at?. International Journal of Computer Vision. 98, 324-334. https://doi.org/10.1007/s11263-011-0511-6
9. Utsumi, A, Okamoto, K, Hagita, N, and Takahashi, K . Gaze tracking in wide area using multiple camera observations., Proceedings of the Symposium on Eye Tracking Research and Applications, 2012, Santa Barbara, CA, Array, pp.273-276. https://doi.org/10.1145/2168556.2168614
10. Manolova, A, Panev, S, and Tonchev, K (2014). Human gaze tracking with an active multi-camera system. Biometric Authentication. Cham, Switzerland: Springer, pp. 176-188 https://doi.org/10.1007/978-3-319-13386-7_14
11. Schneider, T, Schauerte, B, and Stiefelhagen, R . Mani-fold alignment for person independent appearance-based gaze estimation., Proceedings of 2014 22nd International Conference on Pattern Recognition, 2014, Stockholm, Sweden, Array, pp.1167-1172. https://doi.org/10.1109/ICPR.2014.210
12. Yucel, Z, Salah, AA, Mericli, C, Mericli, T, Valenti, R, and Gevers, T (2013). Joint attention by gaze interpolation and saliency. IEEE Transactions on Cybernetics. 43, 829-842. https://doi.org/10.1109/TSMCB.2012.2216979
13. Krafka, K, Khosla, A, Kellnhofer, P, Kannan, H, Bhandarkar, S, Matusik, W, and Torralba, A . Eye tracking for everyone., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, Array, pp.2176-2184. https://doi.org/10.1109/cvpr.2016.239
14. Yosinski, J, Clune, J, Bengio, Y, and Lipson, H (2014). How transferable are features in deep neural networks?. Advances in Neural Information Processing Systems. 27, 3320-3328.
15. Zhang, X, Sugano, Y, Fritz, M, and Bulling, A . Appearance-based gaze estimation in the wild., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, Array, pp.4511-4520. https://doi.org/10.1109/cvpr.2015.7299081
16. Zhang, X, Sugano, Y, Fritz, M, and Bulling, A (2019). MPI-IGaze: real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 41, 162-175. https://doi.org/10.1109/TPAMI.2017.2778103
17. Mbouna, RO, Kong, SG, and Chun, MG (2013). Visual analysis of eye state and head pose for driver alertness monitoring. IEEE Transactions on Intelligent Transportation Systems. 14, 1462-1469. https://doi.org/10.1109/TITS.2013.2262098
18. Mbouna, RO, and Kong, SG (2012). Pupil center detection with a single webcam for gaze tracking. Journal of Measurement Science and Instrumentation. 3, 133-136.
19. Kong, SG, and Mbouna, RO (2015). Head pose estimation from a 2D face image using 3D face morphing with depth parameters. IEEE Transactions on Image Processing. 24, 1801-1808. https://doi.org/10.1109/TIP.2015.2405483
20. Lai, CC, Shih, SW, and Hung, YP (2015). Hybrid method for 3-D gaze tracking using glint and contour features. IEEE Transactions on Circuits and Systems for Video Technology. 25, 24-37. https://doi.org/10.1109/TCSVT.2014.2329362
21. Hochreiter, S, and Schmidhuber, J (1997). Long short-term memory. Neural Computation. 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
22. Wu, Z, Wang, X, Jiang, YG, Ye, H, and Xue, X . Modeling spatial-temporal clues in a hybrid deep learning framework for video classification., Proceedings of the 23rd ACM International Conference on Multimedia, 2015, Brisbane, Australia, Array, pp.461-470. https://doi.org/10.1145/2733373.2806222
23. Sun, L, Jia, K, Chen, K, Yeung, DY, Shi, BE, and Savarese, S . Lattice long short-term memory for human action recognition., Proceedings of the IEEE International Conference on Computer Vision, 2017, Venice, Italy, Array, pp.2166-2175. https://doi.org/10.1109/iccv.2017.236
24. Donahue, J, Anne Hendricks, L, Guadarrama, S, Rohrbach, M, Venugopalan, S, Saenko, K, and Darrell, T . Long-term recurrent convolutional networks for visual recognition and description., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, Array, pp.2625-2634. https://doi.org/10.21236/ada623249

Minh Thanh Vo received the B.S. in computer science from VNU University of Science, Ho Chi Minh City, Vietnam in 2016. He received the M.S. degree from Sejong University, Seoul, Korea. His research interests include gaze estimation, 3D face modeling, and machine learning.

E-mail: vmthanh@sejong.ac.kr

Seong G. Kong received the B.S. and M.S. degrees from Seoul National University, Seoul, Korea, and the Ph.D. degree from the University of Southern California, Los Angeles, CA, USA. He is currently Professor of Computer Engineering at Sejong University, Seoul, Korea. He was a recipient of the best paper award from the International Conference on Pattern Recognition in 2004, the Honorable Mention Paper Award from the American Society of Agricultural and Biological Engineers, and the Most Cited Paper Award from Computer Vision and Image Understanding in 2007 and 2008. His professional services include Associate Editor of IEEE Transactions on Neural Networks, Guest Editor of a special issue of International Journal of Control, Automation, and Systems, Guest Editor of a special issue of Journal of Sensors.

E-mail: skong@sejong.edu

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127

Published online June 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.2.117

Enhanced Gaze Tracking Using Convolutional Long Short-Term Memory Networks

Minh-Thanh Vo and Seong G. Kong

Department of Computer Engineering, Sejong University, Seoul, Korea

Correspondence to:Seong G. Kong (skong@sejong.edu)

Received: June 16, 2022; Revised: March 6, 2022; Accepted: March 29, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.

Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks

1. Introduction

Eye gaze is an important nonverbal cue in human communication. Gaze estimation and tracking refer to the process of measuring and tracing the location or direction of the gaze point at which a human subject is looking. The gaze point reflects the focus of the user’s attention, making it a significant observable indicator for designing human-computer interaction systems. Vision-based gaze tracking devices often use a camera mounted on a computer monitor screen or mobile device to capture a video stream of the user’s eye movements while they use a device. An internal computing unit then processes the image data in each frame of the video stream to estimate the spatial coordinates of the gaze point on the monitor screen. Estimating the gaze point can be useful in a wide variety of applications, including human-computer interfaces, marketing surveys, and recommendation systems based on customer preferences. Gaze estimation helps designers create smart marketing analysis systems because the position and duration of the gaze point can reveal interest in a particular product when a customer browses a webpage in an online shopping mall. Major use cases vary from controlled gaze functions for assisting people with disabilities [1], driver support in automotive platforms [2], visual behavior analysis [3], student attention analysis, targeted delivery of advertisements, and cognitive studies [4].

Many successful approaches for estimating gaze points have been studied over the past few decades. Gaze estimation methods can be classified into three categories: geometric, feature-based, and appearance-based methods. Geometric methods use information on eye shape, including eye open-close states, eye size, and eye color, to determine gaze points [57]. These methods have achieved promising results, but often require multiple cameras and changes in illumination and types of light sources tend to reduce accuracy. Feature-based approaches [810] attempt to extract distinctive features such as contours, eye corners, and corneal reflections or glints. Such methods work well in controlled indoor environments. However, small eyes or facial attachments such as glasses, hats, and eyelashes can degrade gaze tracking performance. Appearance-based approaches have attracted significant attention because they have paved the way for gaze estimation in everyday settings and uncontrolled environments [11,12].

The recent successes of deep learning in object recognition has inspired its extension to gaze tracking domains. The latest appearance-based techniques use convolutional neural networks (CNNs) for gaze estimation [13]. Although mutual connections exist between eye or face images and gaze directions, such methods may only consider training datasets of static images and often do not consider temporal information when people change gaze their direction. Variations in facial features contain vital information that helps improve gaze estimation accuracy. By understanding the history of variations, the current gaze point can be estimated more precisely. Long short-term memory (LSTM) is a short-term memory model that persists for a long period of time, making it well-suited to classification and prediction problems with a time series of constraints. The research question addressed in this study is how we can apply the concept of LSTM to improve the accuracy of gaze estimation.

This paper presents a gaze estimation and tracking method using convolutional long short-term memory (C-LSTM) networks to encode temporal changes in facial features when a subject is naturally engaged in the act of looking through a set of information over a certain period of time using a computer or mobile device. The proposed scheme performs end-to-end training for a deep learning model to estimate gaze points. Given a s uence of input video frames from the camera mounted on a computer, the iTracker model [13] detects and locates regions of interest (ROIs) such as the left eye, right eye, face, and face grid. The face grid denotes a binary mask that is used to indicate the location and size of the face within an image. A set of convolutional layers in the iTracker model separately extract facial features from the ROIs. Popular CNN models pre-trained on large image datasets such as AlexNet, VGG, and GoogleNet can be used as alternative feature extractors [14] because meaningful visual information such as the edges, contours, and shapes of an object in an image remain unchanged across various problems. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset [13] to extract feature vectors. A sequence of facial feature vectors with known spatial coordinates for the corresponding gaze points was used to train the LSTM network to learn the relationships between changes in facial features reflecting eye movements, head rotations, and facial expressions, and the corresponding positions of the gaze points. After training, the current gaze point is estimated using the facial features of the current frame, as well as changes in the gaze point coordinates of previous frames. To compensate for the differences associated with different subjects, a calibration process based on support vector regression is performed to fine-tune the coordinates of the predicted gaze points. To evaluate the performance of gaze tracking, the proposed method is compared to three state-of-the-art appearance-based gaze tracking methods: the iTracker model [13], multimodal CNN model [15], and spatial weight CNN model [16]. Two standard databases are used for benchmarking: GazeCapture [13] and MPIIGaze [16]. The GazeCapture dataset is a large-scale dataset for eye tracking that contains data from over 1, 450 people with almost 2.5 million frames. The proposed method outperforms the state-of-the-art approaches after calibration with estimation errors of 0.82 and 0.92 cm on the horizontal and vertical axes, respectively. The proposed method was also tested on the MPIIGaze dataset and it achieved a low gaze angle error of 6.1°.

2. Related Work

General procedures for gaze tracking include preprocessing, feature extraction, and estimation of the gaze point. The preprocessing step includes face detection to detect and crop the face regions in the scene. Then, various ROIs are selected, including the eyes [17] and pupils [18]. Head pose angles are used to compensate for the disparity caused by non-frontal head poses [19]. In the next step, features are extracted from an input image using different methods. The processed features are used in various regression functions such as support vector regression and linear regression to map features to predicted gaze points. Based on the advantages of deep learning and convolutional methods, CNN models can be trained and generalized to extract meaningful features from images for both low-level representation (e.g., edges, contours, and blobs) and high-level representation (e.g., heads and eyes) without human intervention. The fully connected layers in CNN models then serve as regression functions to map learned features to the predicted spatial coordinates of gaze points.

Geometric gaze tracking approaches attempt to infer gaze directions from the elliptical shape of the observed limbus or complex shape incorporating the eyelids. Funes Mora et al. [5] proposed a generative process to generate eye images and set up an understandable 3D gaze point. Lee et al. [7] proposed a calibration method utilizing the 3D geometrical relationship between the light source position and images reflected in various positions and poses. However, such methods require an external illumination source such as an infrared light, as well as a calibration process. Feature-based gaze tracking approaches find distinctive features such as contours, eye corners, cornea reflections, or glint flection. Valenti and Gevers [6] enhanced gaze estimation using prior knowledge regarding a scene to construct a probability distribution between the scene and gaze points. Manolova et al. [10] added an additional fixed camera into a multi-camera system to determine the position of the human face and components such as the eyes, and used this information to correct the eye landmarks and head pose. Lai et al. [20] used glints and contour features in a 3D gaze tracking method that directly estimates the line of sight in a 3D space. However, these approaches are affected by illumination changes, so they only work in controlled environments such as indoor settings. Appearance-based methods directly infer gaze locations from eye or face images. Early works required a fixed head pose and specific user for training. Several later approaches improved performance by focusing on pose-independent gaze estimation, but they still required specific people for training. Schneider et al. [11] exploited a manifold alignment scheme for person independence to improve gaze estimation. Yucel et al. [12] use Gaussian process regression and neural networks to interpolate gaze directions.

With recent advances in deep learning and large databases for gaze estimation, appearance-based approaches have attracted significant attention from researchers using deep learning schemes for gaze tracking. Based on the availability of large-scale datasets for gaze tracking collected in various settings in unconstrained environments, Krafka et al. [13] presented iTracker, which is a multi-region CNN model that captures both eye and face images as inputs to estimate gaze points. Zhang et al. [15] proposed a multimodal CNN model to take advantage of both eye images and head pose information. They further improved performance by introducing a full-face CNN model that integrates spatial weights to encode information from multiple regions into a standard CNN. However, these methods only process still images without taking advantage of the temporal information in a sequence of video frames. Recurrent neural networks (RNNs) can learn temporal information from a time sequence of data. However, when motion changes over a long period of time, an RNN has difficulties backpropagating error signals through a long-range temporal interval based on the vanishing gradient effect. LSTM [21] was developed as an extension of RNN by using memory cells to store, modify, and access internal states to mitigate the vanishing gradient problem. LSTM has achieved promising results in human activity recognition. Wu et al. [22] proposed a hybrid deep learning model for video classification in which spatial and short-term motion features were extracted by a CNN model and used to train an LSTM model. Sun et al. [23] proposed Lattice-LSTM, which extends LSTM by learning the independent hidden state transitions of memory cells for individual spatial locations. Donahue et al. [24] introduced a recurrent model that can be trained jointly to learn temporal dynamics and convolutional perceptual representations.

3.1 LSTM Networks

The proposed gaze tracking model consists of convolutional layers and an LSTM network for learning the temporal information of facial features from a sequence of video frames. LSTM is an RNN model designed specifically to resolve the problems of vanishing and exploding gradients that occur when computing backpropagation over time. Each cell in an LSTM network is composed of input, output, and forget gates. The cell remembers values over an arbitrary time interval using these three gates to regulate the flow of information in and out of the cell. Figure 1(a) presents general approaches to gaze tracking, where conventional image processing techniques are used to detect the face and ROIs such as eyes, pupil centers, and head poses for feature extraction. Subsequently, a regression model determines the location of the gaze point based on the extracted features. In CNN-based gaze tracking methods, a CNN serves as a feature extractor for extracting meaningful features. These features are evaluated through fully connected layers to determine the gaze point. Figure 1(b) presents the structure of the proposed gaze tracking method.

The proposed gaze tracking method uses a pre-trained CNN from the iTracker model to extract features and an LSTM to extract the temporal information of features. Given an input video frame I(t) with dimension of 224 × 224 pixels at time t, a set of convolutional layers in the iTracker model serve as facial feature extractors for the ROIs. The iTracker CNN was trained using the GazeCapture dataset, which contains 2.5 million images of human faces. The iTracker CNN detected ROIs of the face, such as the face region cropped to a size of 64 × 64 pixels, both eye regions with dimensions of 64 × 64 pixels, and a face grid with dimensions of 64 × 64 pixels representing the location of the face in each frame. Figure 2 presents the computational blocks of the iTracker model. For feature extraction from the eyes, the four convolutional layers are composed of filters with dimensions of 11×11 (96 kernels), 5×5 (256 kernels), 3×3 (384 kernels), and 1×1 (64 kernels). For the face features, the five convolutional layers have the same filter dimensions as those for the eyes. The iTracker CNN transfers the features extracted using the convolutional layers to fully connected layers with sizes of 256, 256, 128, 400, and 400. The outputs from FC-E1, FC-F2, and FC-FG2 are concatenated to form a feature vector x(t) of size 784×1. The feature vector x(t) encodes distinctive facial information such as the face shape, color, texture, and eye regions in the embedded space.

An LSTM network consists of an input gate i(t), forget gate f(t), output gate o(t), input modulation gate g(t), and internal memory cell c(t). Let x(t) be the extracted feature at time t. The previous cell output h(t−1) and previous internal memory state c(t − 1) are concatenated together and enter the LSTM cell. Let σ(x) = 1/(1 + ex) be a sigmoid function. The gate signals at time step t in an LSTM network are updated as

$i(t)=σ(Wxix(t)+Whih(t-1)+bi),$$f(t)=σ(Wxfx(t)+Whfh(t-1)+bf),$$o(t)=σ(Wxox(t)+Whoh(t-1)+bo).$

An input modulation gate is updated as

$g(t)=φ(Wxgx(t)+Whgh(t-1)+bg),$

where φ(x) = (exex)/(ex+ex) is the hyperbolic tangent function. The internal memory cell unit c(t) is the sum of the previous memory cell units c(t−1) modulated by f(t) and i(t) modulated by the input gate g(t).

$c(t)=f(t)⊙c(t-1)+g(t)⊙i(t).$

Here, Ö is an element-wise multiplication operator. The cell output of the LSTM is obtained as

$h(t)=o(t)⊙φ(c(t)).$

The matrices Wxi, Wxf , Wxo, Wxg denote the weights for the input feature x(t) corresponding to the input, forget, output, and input modulation gates, respectively. Similarly, the matrices Whi, Whf , Who, Whg are the weights for the previous cell output h(t − 1) and the vectors bi, bf , bo, bg are the corresponding biases. The internal memory state c(t) allows the network to learn when to forget the previous cell output and when to update the cell output given new information. Because i(t) and f(t) are sigmoidal, their values lie within the range [0, 1]. i(t) and f(t) are considered as tunable parameters that determine when the LSTM learns to forget its previous memory or consider its current input selectively. Similarly, the output gate o(t) learns how much of the memory cell is transferred to the output cell. Figure 3 presents a signal flow diagram for the LSTM network.

Two identical LSTM networks are concatenated to learn variations in facial features over time. The output feature vector h(t) of size 500 × 1 from the LSTM network is fed into two fully connected layers denoted as FC1 and FC2 to map the facial features to the location of the estimated gaze point on the screen. FC1 is composed of 128 hidden units and FC2 contains two hidden units. FC1 processes the feature vector h(t) using a rectified linear unit function ReLU(x) = max(x, 0) in the activation layer. FC2 linearly transforms the output of FC1 into the position of the gaze point (û(t), (t)) as follows:

$h1(t)=ReLU(W1h(t)+b1),$$[u^(t)v^(t)]=W2h1(t)+b2,$

where W1, W2 denote kernel matrices and b1, b2 are the biases of FC1 and FC2, respectively.

3.2 Training the C-LSTM Network for Gaze Estimation

The C-LSTM network performs the task of predicting the coordinates (û (t), (t)) of a subject’s gaze point in an image frame. The C-LSTM network is trained to minimize the mean-squared error between the predicted gaze point (û (t), (t)) and ground-truth gaze point (u(t), v(t)) at time t as

$ɛ=1N∑t=1N[(u(t)-u^(t))2+(v(t)-v^(t))2],$

where N denotes the total number of image frames used for training.

The model is trained using a stochastic gradient descent optimizer over 80 epochs with a batch size of 256. Figure 4 presents the training and validation errors for the C-LSTM network. The model was monitored by checking the validation error after each epoch and by saving the best model with the lowest validation error. The model was executed on an NVIDIA DIGITs deep learning server with four TITAN-X GPUs with 12 GB of memory per GPU, 64 GB of DDR4 RAM, and an Intel Core i7-5930K 6-core 3.5 GHz desktop CPU. The model was implemented using the TensorFlow software. The model was written in Python and executed on the Ubuntu 16.04 operating system. To enable GPU support, the system requires an assortment of drivers and libraries for NVIDA GPUs. The following NVIDIA software must be installed on the system: NVIDIA GPU drivers (CUDA 9.0, which requires version 384.x or higher), the CUDA Toolkit (TensorFlow supports CUDA 9.0, CUPTI ships with the CUDA Toolkit and cuDNN SDK (≥ 7.2)).

4.1 Validation with Benchmarking Datasets

The GazeCapture dataset consists of 1, 474 video clips from 1, 474 volunteers. Each volunteer was asked to use an iOS app called GazeCapture naturally on a mobile device, iPhone, or iPad. In each session, the application sequentially displayed 60 random points and required the user to focus on each point. Each point was displayed for 2 seconds before being randomly moved to another location. The task duration was approximately 5 minutes. This resulted in the generation of 1, 474 video clips with corresponding locations of gaze points on the monitor screen. Users were encouraged to move their heads continuously and change the distances to their mobile devices under a variety of pose, appearance, and illumination conditions.

The GazeCapture dataset was used to train the C-LSTM network. Among the 1, 474 video clips in the dataset, 900 were selected for training, 100 for validation, and 474 for testing. Each video was split into a set of sequences with six frames per sequence, resulting in 1, 490, 959 frames (N = 1, 490, 959) containing both the face and eyes being selected from a total of 2, 445, 504 frames. Figure 5 presents four sample subjects in the GazeCapture dataset. The images were captured in an unconstrained environment under various location and illumination conditions. The CNN model of iTracker extracts the facial features from the sequences. These features are then used to optimize the weight parameters of the LSTM network and fully connected layers in the proposed model. The proposed model provides excellent performance with an average gaze point error less 0.5 cm. To improve performance further, the dataset must be diversified.

To analyze how different regions of the face contribute to the estimation of the gaze point, we visualized the internal outputs of the C-LSTM network to determine how an input is decomposed into the different filters learned by the network. Figure 5 presents the activation outputs of some convolution layers in the C-LSTM network for four sample image frames in the GazeCapture dataset. From top to bottom, Figure 5 shows image frames (224 × 224 pixels), convolutional layers C-F1 (224×224 pixels) and C-F3 (112×112 pixels), and the activation output x(t). The bottom row presents a 2D square image (28 × 28 pixels) representation of the feature vector x(t) with dimensions of 784×1. The model works as expected by detecting some useful information in the face images, including the shape, nose, eyes, and mouth, and encoding them as features.

To avoid overfitting and generalize the gaze estimator to additional datasets, we employed 10-fold stratified cross-validation. We split 1, 000 videos into 10 folds and used nine folds for training, and the last fold for validation. The average validation error was 0.946 cm for the predicted gaze point. Next, we compared the performance of our model to that of other gaze tracking methods based on deep learning. The iTracker model uses a multilayer CNN to capture multiple features extracted from multiple regions of the face, namely left and right eyes, face, and face grid. Zhang et al. [15] performed gaze estimation using multimodal CNNs combining eye images and head-pose information. They expanded this concept to a new model that can perform both 2D and 3D gaze estimation [7] by encoding full-face images in a CNN with spatial weights applied to the feature maps. Table 1 presents the standalone horizontal and vertical errors of the four methods evaluated on a testing dataset consisting of 474 videos in terms of mean absolute error (MAE), which is calculated as

$MAEu=1N∑t=1N|u(t)-u^(t)|,$

where u(t) and û (t) denote the actual and predicted locations of the gaze point along the u axis, respectively, and N is the number of video frames used. The MAE calculation was repeated for the v axis. We implemented the methods proposed in [15] using only eye images extracted from a sequence of image frames to ensure compatibility with the GazeCapture dataset. Table 1 reveals that the proposed method outperforms the iTracker model and two models proposed by Zhang et al. [15,16]. Our proposed scheme outperforms state-of-the-art techniques because it uses the information learned from previous frames to improve the accuracy of gaze estimation in the current frame.

The next experiment considered the MPIIGaze dataset [16], which contains various face images collected from 15 laptop computer users using built-in webcams for several months during their daily lives. Each volunteer was asked to look at a marker in the form of a small grey circle with a white dot in the center, which was displayed randomly on a blank monitor screen every 10 minutes. The volunteers were supposed to fixate their gaze point on the marker and hit the spacebar on the keyboard to confirm the focus of their gaze point on the marker. When the spacebar was pressed, the eye region was cropped to a rectangular image with dimensions of 146 × 50 pixels using detection results based on the face detector and facial landmark detector. The images were then resized to 224 × 224 pixels using bilinear interpolation to increase the image size. A sequence of positions of a marker acting as a target and the actual gaze point were recorded on a laptop screen with a spatial resolution of 1, 280 × 720 pixels. We modified the iTracker CNN model to consider eye images only and removed the convolution layers for the face. Figure 6 presents sample images of a subject wearing eyeglasses from the MPIIGaze dataset with dimensions of 146 × 50 pixels and the features represented by the activation outputs of convolutional layers C-F1 (224 × 224 pixels) and C-E3 (112 × 112 pixels), and fully connected layer FC-E1 (28 × 28 pixels) of the proposed model. These five eye images corresponded to the positions of the target marker on the screen at pixels (503, 470), (600, 450), (730, 403), (831, 400), and (910, 434).

We adopted the same approach as that proposed in [16] to prepare the input data for the proposed method. Specifically, we normalized the images in MPIIGaze by cropping eye images to a fixed resolution of 60 × 36 pixels. The eye images were then histogram-equalized to form the input images. Ground-truth gaze positions were also converted into the normalized camera space to obtain a gaze angle vector that consisted of pitch (nodding) θx and yaw (shaking) θy angles. To apply the proposed method to the MPIIGaze dataset, we modified the loss function to use the mean-squared error between the estimated gaze angle vectors (θ̂x(t), θ̂y(t)) and actual gaze angle vectors (θx(t), θy(t)) at time t as follows:

$ɛθ=1M∑t=1M[(θx(t)-θ^x(t))2+(θy(t)-θ^y(t))2].$

All other settings were the same. The normalized training dataset consisted of 630, 000 eye images (M = 630, 000), and the testing images were 3, 000 eye images captured from the left and right eyes in MPIIGaze. We used 80% of the images in the training dataset for training and 20% of the images for validation. We compared the results to those of the other methods reported in [16], namely support vector regression (SVR), adaptive learning regression (ALR), k-nearest neighbors (kNN), a random forest (RF), MinistNet, CNN model with spatial weights, and multimodal CNN. Table 2 lists the mean-squared gaze angle errors of the various methods evaluated on the MPIIGaze dataset. The proposed method the smallest error of 6.1° on the dataset. This represents significant performance improvements of 4.7° and 3.7° compared to the two state-of-the-art methods, namely the multimodal CNN and spatial weight CNN, respectively.

4.2 Real-Time Gaze Estimation

We performed gaze tracking experiments in real time using a personal computer. The experimental setup included a 27 inch Samsung LCD monitor with screen dimensions of 62.62×47.38 cm and a screen resolution of 1, 920×1, 080 pixels, an attached Logitech C310 webcam with a resolution of 1, 280 × 720 pixels, and personal computer with an Intel Core i5-7600 CPU, GeForce GTX 1060 6 GB GPU, and 16 GB of RAM. The camera was mounted on top of the monitor screen at approximately the eye level of the computer user. The camera captured user activity in real time at a rate of 30 fps. The predicted gaze point was estimated based on the current frame image and temporal information retrieved from the previous five frames. To evaluate the accuracy of the estimated gaze points, the monitor displayed calibration makers sequentially at nine fixed locations (marked with “+” symbols) on the screen, equally spaced by 440 pixels vertically and 840 pixels horizontally, which are equivalent to 19.3 and 27.4 cm, respectively. Figure 7 presents the experimental setup and locations of the markers on the monitor screen. In the figure, the red dot represents the current location of the target marker and the orange cross in the small circle indicates the location of the current gaze point. During data collection sessions, the marker appeared at one location at a time and moved to the next location after 5 seconds. This continued until the calibration marker appeared at all locations. The volunteers participated in the experiment while sitting in front of the monitor at a distance of approximately 50 cm to 1 m. The evaluation was conducted with 10 volunteers (five males and five females) in two settings, namely wearing glasses and not wearing glasses, as shown in Table 3.

Because the proposed model was trained on the GazeCapture dataset, the actual gaze points must be calibrated on the monitor screen being used. In each experiment session, a user sat in front of the monitor and was asked to look at a calibration marker on the monitor screen. Given a set of training points {$(u^1, u1c),⋯,(u^n, unc)$}, ûi denotes the x coordinate of the recorded predicted gaze point and $uic$ is the corresponding x coordinate of the calibration point, where n is the number of training points recorded in the calibration process.

After the calibration process, we evaluated the accuracy of gaze tracking. The video frames captured for each user were fed into the proposed model to compute the spatial coordinates of the initial gaze point. The initial gaze points were then calibrated to obtain the final estimated gaze points on the monitor screen. Next, a volunteer was asked to gaze at a target marker on the screen. Ten target marker points were displayed on the monitor screen at random locations and each point was presented for 5 seconds. The predicted gaze points were recorded in this second session. Table 4 presents the gaze estimation errors in terms of MAE in centimeters between the estimated gaze points and actual locations for the male and female subjects. The gaze estimation accuracies were improved significantly compared to the iTracker model without using LSTM from 8.84 to 2.31 cm along the x axis and from 10.91 to 3.41 cm along the y axis on average. Table 5 presents the gaze estimation errors in terms of MAE in centimeters between the estimated gaze points and actual locations for subjects wearing glasses and not wearing glasses. The gaze estimation accuracies improved significantly with LSTM from 8.00 to 2.24 cm along the x axis and from 10.31 to 3.45 cm along the y axis on average.

A gaze tracking experiment session was conducted to generate a heat map representing the density of gaze points within different objects on a monitor screen. The gaze points estimated using the proposed method were aggregated based on location and we assigned a radius of influence to the gaze points. The radius of influence is set to 50 pixels. As the density of the gaze points increases in an area, the heat map displays a color indicating a higher intensity. The subjects were presented with an image containing three rabbits and three turtles on the monitor screen. They were asked to look at only the rabbits for 3 minutes. Figure 8 presents the images and heat maps of the movement of the predicted gaze points on different rabbits before and after the calibration process. As the density of the gaze points on the rabbits increases, a more saturated red color appears on the rabbits in the heat map, indicating regions in which the subjects were interested. Before calibration, the predicted gaze points fluctuated around the objects that the subjects were looking at, so the corresponding heat map regions are spread out. After calibration, the estimated gaze points become more stable and focused on the objects, so the heat map regions are highly concentrated.

5. Conclusion

This paper presented a C-LSTM network for enhancing gaze estimation and tracking. The proposed C-LSTM network learns temporal changes in facial features to improve the accuracy of gaze point estimation. Changes in facial features when a subject is engaged in looking at an object over a period of time can be used to estimate the position of the gaze point precisely. Given a sequence of input video frames, a set of convolutional layers extracts facial features from ROIs such as the left eye, right eye, face, and face grid of the subject. The proposed scheme uses an iTracker CNN trained on 2.5 million image frames from the GazeCapture dataset to extract feature vectors. After training, the current gaze point is estimated using the facial features of the current frame, as well as the changes in gaze point coordinates of previous frames. Experimental results demonstrated that the proposed C-LSTM network achieves a significant improvement in terms of the accuracy of gaze tracking. The proposed method was compared to state-of-the-art techniques using deep learning for gaze estimation on the GazeCapture dataset with MAEs of 0.82 and 0.92 cm along the x and y axes, respectively. The proposed method was also tested on the MPIIGaze dataset and it achieved a low gaze angle error of 6.1°. Additionally, the proposed model was tested in real time at a rate of 30 fps to visualize gaze points and generate heat maps. In the future, we plan to diversify our datasets to improve gaze estimation accuracy further.

Fig 1.

Figure 1.

Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 2.

Figure 2.

Feature extractor adopted from the iTracker model.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 3.

Figure 3.

Signal flow in the gaze estimator based on an LSTM network.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 4.

Figure 4.

Training and validation errors for the C-LSTM network.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 5.

Figure 5.

Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of x(t).

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 6.

Figure 6.

Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 7.

Figure 7.

Locations of target markers for calibration on a monitor screen with dimensions of 62.6 × 47.4 cm (1920 × 1080 pixels). Nine (3 × 3) locations are equally spaced on the monitor screen.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 8.

Figure 8.

Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Performance comparisons based on the MAE (CM) of the proposed abd state-of-the-art methods.

Methodsx-axisy-axis
iTracker [13]1.45 ± 1.431.67 ± 1.62
Multimodal CNN [15]2.53 ± 1.872.38 ± 1.75
CNN with spatial weights [16]1.53 ± 1.541.71 ± 1.67
Proposed0.82 ± 1.120.92 ± 1.22

Mean-squared gaze angle error (EΘ) of the state-of-the-art-gaze angle estimation techniques on the MPIIgaze dataset with 45, 000 testing eye images.

Mean-squared gaze angle error, Eθ (°)
SVR16.5
ALR16.4
kNN16.2
RF15.4
MinistNet13.9
Spatial weight CNN10.8
Multimodal CNN9.8
Proposed6.1

Number of volunteers participating in the gaze tracking experiments.

With glassesWithout glasses
Male32
Female14

Gaze estimation errors (CM) of the proposed method for male and female subjects.

w/o LSTMw/ LSTM
x axisy axisx axisy axis
Male9.5211.352.623.52
Female8.1610.472.003.29
Average8.8410.912.313.41

Gaze esitimation errors (CM) of the proposed method for subjects with and without glasses.

w/o LSTMw/LSTM
x axisy axisx axisy axis
Glasses8.7611.102.943.83
No glasses7.249.531.543.07
Average8.0010.312.243.45

References

1. Guestrin, ED, and Eizenman, M . Remote point-of-gaze estimation requiring a single-point calibration for applications with infants., Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, 2008, Savannah, GA, Array, pp.267-274. https://doi.org/10.1145/1344471.1344531
2. Liang, Y, Reyes, ML, and Lee, JD (2007). Real-time detection of driver cognitive distraction using support vector machines. IEEE Transactions on Intelligent Transportation Systems. 8, 340-350. https://doi.org/10.1109/TITS.2007.895298
3. Morimoto, CH, and Mimica, MR (2005). Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding. 98, 4-24. https://doi.org/10.1016/j.cviu.2004.07.010
4. Duchowski, AT (2002). A breadth-first survey of eye-tracking applications. Behavior Research Methods, Instruments, & Computers. 34, 455-470. https://doi.org/10.3758/BF03195475
5. Funes Mora, KA, and Odobez, JM . Geometric generative gaze estimation (g3e) for remote RGB-D cameras., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, Columbus, OH, Array, pp.1773-1780. https://doi.org/10.1109/cvpr.2014.229
6. Valenti, R, and Gevers, T . Accurate eye center location and tracking using isophote curvature., Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, Anchorage, AK, Array, pp.1-8. https://doi.org/10.1109/cvpr.2008.4587529
7. Lee, H, Iqbal, N, Chang, W, and Lee, SY (2013). A calibration method for eye-gaze estimation systems based on 3D geometrical optics. IEEE Sensors Journal. 13, 3219-3225. https://doi.org/10.1109/JSEN.2013.2268247
8. Valenti, R, Sebe, N, and amd Gevers, T (2012). What are you looking at?. International Journal of Computer Vision. 98, 324-334. https://doi.org/10.1007/s11263-011-0511-6
9. Utsumi, A, Okamoto, K, Hagita, N, and Takahashi, K . Gaze tracking in wide area using multiple camera observations., Proceedings of the Symposium on Eye Tracking Research and Applications, 2012, Santa Barbara, CA, Array, pp.273-276. https://doi.org/10.1145/2168556.2168614
10. Manolova, A, Panev, S, and Tonchev, K (2014). Human gaze tracking with an active multi-camera system. Biometric Authentication. Cham, Switzerland: Springer, pp. 176-188 https://doi.org/10.1007/978-3-319-13386-7_14
11. Schneider, T, Schauerte, B, and Stiefelhagen, R . Mani-fold alignment for person independent appearance-based gaze estimation., Proceedings of 2014 22nd International Conference on Pattern Recognition, 2014, Stockholm, Sweden, Array, pp.1167-1172. https://doi.org/10.1109/ICPR.2014.210
12. Yucel, Z, Salah, AA, Mericli, C, Mericli, T, Valenti, R, and Gevers, T (2013). Joint attention by gaze interpolation and saliency. IEEE Transactions on Cybernetics. 43, 829-842. https://doi.org/10.1109/TSMCB.2012.2216979
13. Krafka, K, Khosla, A, Kellnhofer, P, Kannan, H, Bhandarkar, S, Matusik, W, and Torralba, A . Eye tracking for everyone., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, Las Vegas, NV, Array, pp.2176-2184. https://doi.org/10.1109/cvpr.2016.239
14. Yosinski, J, Clune, J, Bengio, Y, and Lipson, H (2014). How transferable are features in deep neural networks?. Advances in Neural Information Processing Systems. 27, 3320-3328.
15. Zhang, X, Sugano, Y, Fritz, M, and Bulling, A . Appearance-based gaze estimation in the wild., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, Array, pp.4511-4520. https://doi.org/10.1109/cvpr.2015.7299081
16. Zhang, X, Sugano, Y, Fritz, M, and Bulling, A (2019). MPI-IGaze: real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 41, 162-175. https://doi.org/10.1109/TPAMI.2017.2778103
17. Mbouna, RO, Kong, SG, and Chun, MG (2013). Visual analysis of eye state and head pose for driver alertness monitoring. IEEE Transactions on Intelligent Transportation Systems. 14, 1462-1469. https://doi.org/10.1109/TITS.2013.2262098
18. Mbouna, RO, and Kong, SG (2012). Pupil center detection with a single webcam for gaze tracking. Journal of Measurement Science and Instrumentation. 3, 133-136.
19. Kong, SG, and Mbouna, RO (2015). Head pose estimation from a 2D face image using 3D face morphing with depth parameters. IEEE Transactions on Image Processing. 24, 1801-1808. https://doi.org/10.1109/TIP.2015.2405483
20. Lai, CC, Shih, SW, and Hung, YP (2015). Hybrid method for 3-D gaze tracking using glint and contour features. IEEE Transactions on Circuits and Systems for Video Technology. 25, 24-37. https://doi.org/10.1109/TCSVT.2014.2329362
21. Hochreiter, S, and Schmidhuber, J (1997). Long short-term memory. Neural Computation. 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
22. Wu, Z, Wang, X, Jiang, YG, Ye, H, and Xue, X . Modeling spatial-temporal clues in a hybrid deep learning framework for video classification., Proceedings of the 23rd ACM International Conference on Multimedia, 2015, Brisbane, Australia, Array, pp.461-470. https://doi.org/10.1145/2733373.2806222
23. Sun, L, Jia, K, Chen, K, Yeung, DY, Shi, BE, and Savarese, S . Lattice long short-term memory for human action recognition., Proceedings of the IEEE International Conference on Computer Vision, 2017, Venice, Italy, Array, pp.2166-2175. https://doi.org/10.1109/iccv.2017.236
24. Donahue, J, Anne Hendricks, L, Guadarrama, S, Rohrbach, M, Venugopalan, S, Saenko, K, and Darrell, T . Long-term recurrent convolutional networks for visual recognition and description., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, Boston, MA, Array, pp.2625-2634. https://doi.org/10.21236/ada623249