search for


Visual Speech Recognition of Korean Words Using Convolutional Neural Network
International Journal of Fuzzy Logic and Intelligent Systems 2019;19(1):1-9
Published online March 25, 2019
© 2019 Korean Institute of Intelligent Systems.

Sung-Won Lee, Je-Hun Yu, Seung-Min Park, and Kwee-Bo Sim

Department of Electronic and Electrical Engineering, Chung-Ang University, Seoul, Korea
Correspondence to: Kwee-Bo Sim (
Received June 5, 2018; Revised September 7, 2018; Accepted September 7, 2018.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In recent studies, speech recognition performance is greatly improved by using HMM and CNN. HMM is studying statistical modeling of voice to construct an acoustic model and to reduce the error rate by predicting voice through image of mouth region using CNN. In this paper, we propose visual speech recognition (VSR) using lip images. To implement VSR, we repeatedly recorded three subjects speaking 53 words chosen from an emergency medical service vocabulary book. To extract images of consonants, vowels, and final consonants in the recorded video, audio signals were used. The Viola–Jones algorithm was used for lip tracking on the extracted images. The lip tracking images were grouped and then classified using CNNs. To classify the components of a syllable including consonants, vowels, and final consonants, the structure of the CNN used VGG-s and modified LeNet-5, which has more layers. All syllable components were classified, and then the word was found by the Euclidean distance. From this experiment, a classification rate of 72.327% using 318 total testing words was obtained when VGG-s was used. When LeNet-5 applied this classifier for words, however, the classification rate was 22.327%.
Keywords : Convolutional neural network, Human–robot interaction, Korean word recognition, Viola–Jones algorithm, Visual speech recognition