International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127
Published online June 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.2.117
© The Korean Institute of Intelligent Systems
Minh-Thanh Vo and Seong G. Kong
Department of Computer Engineering, Sejong University, Seoul, Korea
Correspondence to :
Seong G. Kong (skong@sejong.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.
Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks
There are no potential conflicts of interest relevant to this article.
E-mail: vmthanh@sejong.ac.kr
E-mail: skong@sejong.edu
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(2): 117-127
Published online June 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.2.117
Copyright © The Korean Institute of Intelligent Systems.
Minh-Thanh Vo and Seong G. Kong
Department of Computer Engineering, Sejong University, Seoul, Korea
Correspondence to:Seong G. Kong (skong@sejong.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper presents convolutional long short-term memory (C-LSTM) networks for improving the accuracy of gaze estimation. C-LSTM networks learn temporal variations in facial features while a human subject looks at objects displayed on a monitor screen equipped with a live camera. Given a sequence of input video frames, a set of convolutional layers individually extracts facial features from regions of interest such as the left eye, right eye, face, and face grid of the subject. Subsequently, an LSTM network encodes the relationships between changes in facial features over time and the position of the gaze point. C-LSTM networks are trained on a set of input-output data pairs of facial features and corresponding positions of the gaze point, and the spatial coordinates of the gaze point are determined based on the facial features of the current frame and information from previous frames to improve the accuracy of gaze estimation. Experiment results demonstrate that the proposed scheme achieves significant improvement of gaze tracking performance with average gaze estimation errors of 0.82 and 0.92 cm in the horizontal and vertical axes, respectively, on the GazeCapture dataset and an average angular error of 6:1 on the MPIIGaze dataset.
Keywords: Gaze tracking, Gaze estimation, Long short-term memory, Convolutional neural networks
Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.
Feature extractor adopted from the iTracker model.
Signal flow in the gaze estimator based on an LSTM network.
Training and validation errors for the C-LSTM network.
Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of
Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.
Locations of target markers for calibration on a monitor screen with dimensions of 62.6
Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.
Table 2 . Mean-squared gaze angle error (EΘ) of the state-of-the-art-gaze angle estimation techniques on the MPIIgaze dataset with 45, 000 testing eye images.
Mean-squared gaze angle error, Eθ (°) | |
---|---|
SVR | 16.5 |
ALR | 16.4 |
kNN | 16.2 |
RF | 15.4 |
MinistNet | 13.9 |
Spatial weight CNN | 10.8 |
Multimodal CNN | 9.8 |
Proposed | 6.1 |
Table 3 . Number of volunteers participating in the gaze tracking experiments.
With glasses | Without glasses | |
---|---|---|
Male | 3 | 2 |
Female | 1 | 4 |
Table 4 . Gaze estimation errors (CM) of the proposed method for male and female subjects.
w/o LSTM | w/ LSTM | |||
---|---|---|---|---|
Male | 9.52 | 11.35 | 2.62 | 3.52 |
Female | 8.16 | 10.47 | 2.00 | 3.29 |
Average | 8.84 | 10.91 | 2.31 | 3.41 |
Table 5 . Gaze esitimation errors (CM) of the proposed method for subjects with and without glasses.
w/o LSTM | w/LSTM | |||
---|---|---|---|---|
Glasses | 8.76 | 11.10 | 2.94 | 3.83 |
No glasses | 7.24 | 9.53 | 1.54 | 3.07 |
Average | 8.00 | 10.31 | 2.24 | 3.45 |
Igor V. Arinichev, Sergey V. Polyanskikh, Galina V. Volkova, and Irina V. Arinicheva
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(1): 1-11 https://doi.org/10.5391/IJFIS.2021.21.1.1Azzaya Nomuunbayar, and Sanggil Kang
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(4): 333-338 https://doi.org/10.5391/IJFIS.2018.18.4.333Hyukdoo Choi
Int. J. Fuzzy Log. Intell. Syst. -0001; 17(2): 98-106 https://doi.org/10.5391/IJFIS.2017.17.2.98Schematic of gaze tracking techniques: (a) general approaches and (b) proposed C-LSTM gaze tracking scheme.
|@|~(^,^)~|@|Feature extractor adopted from the iTracker model.
|@|~(^,^)~|@|Signal flow in the gaze estimator based on an LSTM network.
|@|~(^,^)~|@|Training and validation errors for the C-LSTM network.
|@|~(^,^)~|@|Sample subjects in the GazeCapture dataset and visualization of the activation outputs of convolution layers. From top to bottom: images, activation outputs of C-F1 and C-F3, and a 2D representation of
Sample images in the MPIIGaze dataset and the positions of target markers on the monitor screen. Five eye images were captured for five corresponding positions of the target marker. From top to bottom: eye images and activation outputs of C-F1, C-E3, and FC-E1.
|@|~(^,^)~|@|Locations of target markers for calibration on a monitor screen with dimensions of 62.6
Heat maps of gaze tracking when a subject was asked to look at only rabbits: (a) illustration of test screen, (b) heat map of predicted gaze points recorded before calibration, (c) heat map of predicted gaze points recorded after calibration with glasses, and (d) heat map of predicted gaze points recorded when a subject was looking rabbits after calibration without glasses.