Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127

Published online June 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.2.117

© The Korean Institute of Intelligent Systems

Enhanced Gaze Tracking Using Convolutional Long Short-Term Memory Networks

Minh-Thanh Vo and Seong G. Kong

Department of Computer Engineering, Sejong University, Seoul, Korea

Correspondence to :
Seong G. Kong (skong@sejong.edu)

Received: June 16, 2022; Revised: March 6, 2022; Accepted: March 29, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.

Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks

This research was supported by the Faculty Research Fund of Sejong University (2021).

There are no potential conflicts of interest relevant to this article.

Minh Thanh Vo received the B.S. in computer science from VNU University of Science, Ho Chi Minh City, Vietnam in 2016. He received the M.S. degree from Sejong University, Seoul, Korea. His research interests include gaze estimation, 3D face modeling, and machine learning.

E-mail: vmthanh@sejong.ac.kr

Seong G. Kong received the B.S. and M.S. degrees from Seoul National University, Seoul, Korea, and the Ph.D. degree from the University of Southern California, Los Angeles, CA, USA. He is currently Professor of Computer Engineering at Sejong University, Seoul, Korea. He was a recipient of the best paper award from the International Conference on Pattern Recognition in 2004, the Honorable Mention Paper Award from the American Society of Agricultural and Biological Engineers, and the Most Cited Paper Award from Computer Vision and Image Understanding in 2007 and 2008. His professional services include Associate Editor of IEEE Transactions on Neural Networks, Guest Editor of a special issue of International Journal of Control, Automation, and Systems, Guest Editor of a special issue of Journal of Sensors.

E-mail: skong@sejong.edu

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127

Published online June 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.2.117

Copyright © The Korean Institute of Intelligent Systems.

Enhanced Gaze Tracking Using Convolutional Long Short-Term Memory Networks

Minh-Thanh Vo and Seong G. Kong

Department of Computer Engineering, Sejong University, Seoul, Korea

Correspondence to:Seong G. Kong (skong@sejong.edu)

Received: June 16, 2022; Revised: March 6, 2022; Accepted: March 29, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.

Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks

Fig 1.

Figure 1.

Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 2.

Figure 2.

Feature extractor adopted from the iTracker model.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 3.

Figure 3.

Signal flow in the gaze estimator based on an LSTM network.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 4.

Figure 4.

Training and validation errors for the C-LSTM network.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 5.

Figure 5.

Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of x(t).

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 6.

Figure 6.

Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 7.

Figure 7.

Locations of target markers for calibration on a monitor screen with dimensions of 62.6 × 47.4 cm (1920 × 1080 pixels). Nine (3 × 3) locations are equally spaced on the monitor screen.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Fig 8.

Figure 8.

Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 117-127https://doi.org/10.5391/IJFIS.2022.22.2.117

Table 1 . Performance comparisons based on the MAE (CM) of the proposed abd state-of-the-art methods.

Methodsx-axisy-axis
iTracker [13]1.45 ± 1.431.67 ± 1.62
Multimodal CNN [15]2.53 ± 1.872.38 ± 1.75
CNN with spatial weights [16]1.53 ± 1.541.71 ± 1.67
Proposed0.82 ± 1.120.92 ± 1.22

Table 2 . Mean-squared gaze angle error (EΘ) of the state-of-the-art-gaze angle estimation techniques on the MPIIgaze dataset with 45, 000 testing eye images.

Mean-squared gaze angle error, Eθ (°)
SVR16.5
ALR16.4
kNN16.2
RF15.4
MinistNet13.9
Spatial weight CNN10.8
Multimodal CNN9.8
Proposed6.1

Table 3 . Number of volunteers participating in the gaze tracking experiments.

With glassesWithout glasses
Male32
Female14

Table 4 . Gaze estimation errors (CM) of the proposed method for male and female subjects.

w/o LSTMw/ LSTM
x axisy axisx axisy axis
Male9.5211.352.623.52
Female8.1610.472.003.29
Average8.8410.912.313.41

Table 5 . Gaze esitimation errors (CM) of the proposed method for subjects with and without glasses.

w/o LSTMw/LSTM
x axisy axisx axisy axis
Glasses8.7611.102.943.83
No glasses7.249.531.543.07
Average8.0010.312.243.45

Share this article on :

Related articles in IJFIS