search for


A Deep-Learning Based Model for Emotional Evaluation of Video Clips
Int. J. Fuzzy Log. Intell. Syst. 2018;18(4):245-253
Published online December 25, 2018
© 2018 Korean Institute of Intelligent Systems.

Byoungjun Kim and Joonwhoan Lee

Division of Computer Science and Engineering, Chonbuk National University, Jeonnju, Korea
Correspondence to: Joonwhoan Lee (
Received December 2, 2018; Revised December 15, 2018; Accepted December 20, 2018.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emotional evaluation of video clips is the difficult task because it includes not only stationary objects as the background but also dynamic objects as the foreground. In addition, there are many video analysis problems to be solved beforehand to properly address the emotion-related tasks. Recently, however, the convolutional neural network (CNN)-based deep learning approach, opens the possibility by solving the action recognition problem. Inspired by the CNN-based action recognition technology, this paper challenges to evaluate the emotion of video clips. In the paper, we propose a deep learning model to capture the video features and evaluate the emotion of a video clip on Thayer 2D emotion space. In the model, the pre-trained convolutional 3D neural network (C3D) generates short-term spatiotemporal features of the video, LSTM accumulates those consecutive time-varying features to characterize long-term dynamic behaviors, and multilayer perceptron (MLP) evaluates emotion of a video clip by regression on the emotion space. Due to the limited number of labeled data, the C3D is employed to extract diverse spatiotemporal from various layers by transfer learning technique. The pre-trained C3D on the Sports-1M dataset and long short term memory (LSTM) followed by the MLP for regression are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of LSTM and the MLP-type emotion estimator. The proposed method achieves the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space.
Keywords : Video emotion analysis, C3D, Transfer learning, LSTM