International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(4): 245-253
Published online December 31, 2018
https://doi.org/10.5391/IJFIS.2018.18.4.245
© The Korean Institute of Intelligent Systems
Byoungjun Kim, and Joonwhoan Lee
Division of Computer Science and Engineering, Chonbuk National University, Jeonnju, Korea
Correspondence to :
Joonwhoan Lee, (chlee@jbnu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emotional evaluation of video clips is the difficult task because it includes not only stationary objects as the background but also dynamic objects as the foreground. In addition, there are many video analysis problems to be solved beforehand to properly address the emotion-related tasks. Recently, however, the convolutional neural network (CNN)-based deep learning approach, opens the possibility by solving the action recognition problem. Inspired by the CNN-based action recognition technology, this paper challenges to evaluate the emotion of video clips. In the paper, we propose a deep learning model to capture the video features and evaluate the emotion of a video clip on Thayer 2D emotion space. In the model, the pre-trained convolutional 3D neural network (C3D) generates short-term spatiotemporal features of the video, LSTM accumulates those consecutive time-varying features to characterize long-term dynamic behaviors, and multilayer perceptron (MLP) evaluates emotion of a video clip by regression on the emotion space. Due to the limited number of labeled data, the C3D is employed to extract diverse spatiotemporal from various layers by transfer learning technique. The pre-trained C3D on the Sports-1M dataset and long short term memory (LSTM) followed by the MLP for regression are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of LSTM and the MLP-type emotion estimator. The proposed method achieves the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space.
Keywords: Video emotion analysis, C3D, Transfer learning, LSTM
Automatic emotional evaluation of diverse media is a long-standing problem in multimedia analysis and artificial intelligence, which requires various component technologies, including feature extraction and selection from media, emotion analysis, and its evaluation. There have been many related types of research on automatic emotion evaluation. Yazdani et al. [1] has performed multimedia content analysis for emotional characterization of music video clips after extracting the audio and visual features. Also, Arifin and Cheung [2] has proposed the video content-based emotion analysis model for high-level video parsing problem. Zhao et al. [3] has proposed a video indexing and recommendation method based on emotion analysis focusing on the viewer’s facial expression recognition. However, emotional evaluation of video is still a difficult and challenging task because it includes not only stationary objects as the background but also dynamic objects as the foreground.
Recently, diverse convolutional neural network (CNN)-type deep learning techniques produce a big success in video analysis. However, traditional two-dimensional CNN (2DCNN) is not well suited for video because it handles only spatial information on a single image. Therefore, the recent researches on video analysis have been conducted with 3DCNN [4] and recurrent neural network (RNN) [5]. Unlike 2DCNN, the 3DCNN explores and exploits spatiotemporal features, even though the captured temporal feature is restricted within the short-time duration of the temporal window. Also, RNN can take a sequential input to characterize dynamic features and make the temporal decision associated with a relatively long-term behavior. Therefore, it has been successfully applied to various fields such as time series and video analysis.
In general, the most of the studies on deep learning-based video emotion analysis has been focused on the dynamic tracking of human facial expressions in video clips and the classification of them. The authors [6–8] have proposed deep learningbased facial image classification that uses the dynamic tracking of human facial expression in a video clip, where 2DCNN and RNN are combined in the structure. Also, Fan et al. [9] has proposed a facial emotion recognition for the combining 2DCNN-RNN and C3D. Vielzeuf et al. [10] has proposed a model of convolutional 3D neural network (C3D) and long short term memory (LSTM) for facial emotion recognition.
However, the structure combining 2DCNN and RNN for video analysis may not be sufficient to explore the information contained in video clips, because it extracts only spatial features from individual frames by 2DCNN, and shifts the responsibility to characterize the dynamic behavior to RNN. So the optical flow has been extracted to characterize the dynamic behavior separately to combine with the spatial information in video analysis [11]. The Vielzeuf’s approach of C3D followed by LSTM, however, the C3D can extract both spatial and temporal behavior in a short time duration effectively, and succeeding LSTM can summarize the dynamic behavior to successfully perform facial emotion recognition.
In our conjecture, there are two problems when we directly employ the structure for facial emotion recognition to the evaluation of generic video emotion. One is the related scope of spatiotemporal features in C3D, and the other is the time duration that human can perceive the emotion.
In this paper, we propose a deep learning model for an emotional evaluation of the video clips based on the conjectures. In the proposed structure, various short-term spatiotemporal features with different scopes are extracted from video clips using the C3D, and those features are aggregated to evaluate relatively long-term dynamic behavior through the LSTM [12].
Because a general type of video has diverse scenes with various shot speeds differently from the face, it is difficult to say that emotions depend only on high-level spatiotemporal features of the C3D. Rather the spatiotemporal features from various levels in the C3D could be useful to evaluate the general type of emotion of video clips. Note that the features from different levels in the C3D have their own scopes of 3D receptive fields in both spatial and temporal dimensions, where the features from lower layers have the smaller receptive fields than those from higher layers. In the proposed model, the emotion for a video clip is assumed to be dependent on the diverse spatiotemporal features from low to high levels of the C3D.
In general, video clips have the different length. So in the proposed model, the spatiotemporal features are temporally resized to be fixed by zero-padding before feeding into LSTM for temporal summarization. According to our conjecture, the summarization of the spatiotemporal features from C3D for a general emotion needs to take a longer duration than the facial expression recognition. The number of recursions in two-layer LSTM subnetwork reflects this conjecture in our model.
In the proposed model, the long-term dynamic features characterized by LSTM are mapped into Thayer 2D emotion space, which consists of valence and arousal dimensions, by the multilayer perceptron (MLP)-type regression model. Therefore, the final emotion is estimated by the regression as a pair of valence and arousal scores in the emotion space.
In general, this type of data-driven machine learning approach requires a large amount of training data. Because the amount of available video emotion database contains only small labeled data, however, the proposed model uses adopts a transfer learning technique by taking the pre-trained C3D on the Sports-1M dataset of to capture the diverse spatiotemporal features. In the end-to-end training, the parameters of pre-trained C3D are fine-tuned with adjusting the weights of LSTM and regression network simultaneously. After training the video emotion one can feel when experiencing a video clip is evaluated in the 2D emotion space.
The experimental results show that the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. This shows that the deliberate construction of a deep learning structure, C3D and then LSTMs followed by the regression network, can provide excellent performance to obtain the emotion, which is one of the high-level video semantics. Therefore, the contribution of this paper can be described as follows.
First, we propose a deep learning based emotion evaluation model of a general type of video clips, which consists of C3D, and then LSTMs followed by regression network.
Second, the proposed model is constructed based on two conjectures, one is the emotion can be well captured by various levels of spatiotemporal features from C3D, and the other is the general emotion from a video clip needs to summarize dynamic features for a longer time than facial expression in RNN.
Third, the proposed model can produce excellent performance compared with other methods.
The paper is organized as follows: the related works are reviewed in Section 2, the details of our proposed method are given in Section 3, and the experimental results and analysis are presented in Section 4. Finally, concluding remarks are given in Section 5.
In general, the emotion of multimedia contents that human feel when they are experienced can be expressed with adjectives. But there are too many adjective terms to be used in the expression. For this reason, psychological experiment with statistical analysis has been done to reduce the number of vocabulary and to define a compressed emotion space, on which adjectives are distinguished as a position. For example, Kawamoto and Soen [13] has analyzed emotions of color patterns with 13 adjective pairs, and Lee abd Park [14] has defined three pairs of adjectives consisting of “warm-cold”, “heavy-light” and “dynamic-static” for emotional evaluation of color images. Another well-known 2D emotion space defined by Thayer [15] is the valence-arousal space. In the space, the valence dimension refers to the positive/negative intrinsic attractiveness, and the arousal dimension expresses the degree of being state of awoken or of sense organs stimulated to a point of perception. According to the amounts of valence and arousal, 12 distinctive adjectives can be located on the boundary of a unit circle as in Figure 1. This paper uses this continuous Thayer’s emotion space to evaluate video and the result is given by a position on the unit circle in 2D space.
In earlier research of CNN-based video analysis, transfer learning has been used [16], where features are independently extracted from the individual frame of video. However, there is a problem in video analysis, because it ignores the temporal information involved in a video clip because only the spatial features are considered. Donahue et al. [17] has proposed a network model combining 2DCNN and LSTM for the video analysis. This method extracts spatial features sequentially from consecutive frames using 2DCNN and then shifts the responsibility to capture the temporal information to LSTM. But this structure provides a limited performance because the 2DCNN extracts only spatial features and the LSTM may have too much burden to extract temporal features and characterize the dynamic features simultaneously. In addition, the 2DCNN with LSTM structure usually extracts the most abstract spatial features from the last fully connected layer following the consecutive 2DCNNs, and feeds them into LSTM. They do not utilize the low-level spatial features from the lower CNN layers so that there might occur information loss.
Simonyan and Zisserman [11] have shown a high performance on video analysis with a two-stream network that consists of two branches of 2DCNNs, one for capturing spatial features and another for temporal behavior from optical flow. But the two-stream network requires a large number of computations because the optical flow must be created in the preprocessing process.
Recently, the focus of video analysis has changed according to the emergence of 3DCNN. A 3DCNN has provided good results in video analysis because it captures spatiotemporal features by 3D convolution kernels and pooling operations. Figure 2 shows that the difference between 2D and 3D convolution operations. Note the scope of the temporal features in 3DCNN is restricted because the number of layers should be increased to capture the long-term temporal features that make the training difficult. Therefore, this 3DCNN captures temporal features defined only in a short-time duration, and not appropriate to characterize a long-term dynamic behavior.
In general, 3D convolution receives
where (
The C3D can model appearance and motion information simultaneously and outperforms the 2DCNN features on various video analysis tasks. In addition, the deeper architecture with a uniform 3 × 3 × 3 kernel size has been empirically verified to produce the best result [18]. The C3D input dimension is 112 × 112 × 3 × 16, consisting of 5 groups of convolutional layers, 5 pooling layers, 2 fully connected layers and softmax. The numbers of filters in the convolution layers are 64, 128, 256, 512, and 512, respectively. There is no size change between convolution layers, that is 3 × 3 × 3 with stride 1 × 1 × 1 and proper zero-padding. All pooling layers used max pooling of kernel size 2 × 2 × 2 except pool1 to preserve the temporal information in the early phase. Figure 3 shows that the C3D structure.
In this section, the proposed emotion evaluation model is explained in detail, including how to extract the spatiotemporal features of video clips using C3D and how to evaluate the emotion using LSTM with MLP-type regression network.
The learning ability of a deep learning model is related to the network depth. In general, even though a deeper network increases the expressiveness, it also makes the training and optimization difficult. In addition, the deeper network with a large number of parameters requires a large amount of training data to avoid overfitting problem. In the proposed structure, various short-term spatiotemporal features are extracted from a video clip using C3D based on transfer learning technique, and those features are aggregated to characterize the long-term dynamic behavior through LSTM. In the evaluation process, emotion is estimated by regression mapping realized by the MLP to valence and arousal axes on Thayer 2D emotion space. Figure 4 shows the proposed structure of emotional evaluation model.
The C3D can act as a feature extractor for video analysis [18]. A video clip passing to the C3D produces spatiotemporal features through the successive convolutional layers and corresponding max-pooling layers. In video clips, it is difficult to say that emotions depend only on the highest-level spatiotemporal features of the C3D. Rather the spatiotemporal features from various levels in C3D layers could be useful to evaluate an emotion from the general type of video clips.
In the proposed method, each video clips passing to the C3D extract the 960 features from the 1st to the 4th max-pooling layers through GAP (global average pooling) and concatenate them with the 4096-dimensional feature vectors in the first fully connected layer. Note that the GAP results of the 1st thru the 4th layer represent the globally averaged spatiotemporal features of the smallest and the largest ranges of receptive fields, respectively. Also, the outputs from the fully connected layer can be interpreted as the most highest-level features for generic action recognition. Therefore, the spatiotemporal features in the proposed model include all the globally aggregated ones with diverse spatiotemporal scopes. Note that the local spatiotemporal features are available unless we take GAP, but it could increase the number of features to make a burden to the following stage of LSTM. This is the reason why we adopt the GAP to aggregate the spatiotemporal features at each level of C3D. Figure 5 shows the low-level and high-level spatiotemporal feature extraction.
In recent study, LSTM has been used to estimate the emotion of a video clip in the same Thayer’s emotion space. But the hand-crafted audio and video features were extracted, then only the selected features were exploited to estimate the degrees of arousal and valence in the LSTM [19]. In the work, the idea of LSTM is similar to ours, in that it takes a role to characterize a long-term dynamic behavior of video clips. In our proposed model, however, LSTM is adjusted concurrently with the fine-tuning of C3D to automatically define the spatiotemporal features in end-to-end training. Note that the duration of a set of C3D features is only 16 frames, so that the temporal components is not enough to define the complicate long-term emotions unless the original video frames are subsampled. That’s why LSTM is necessary to be added for summarizing the long-term temporal behavior.
In the proposed method, we evaluate the emotion of the extracted spatiotemporal features using LSTM with MLP-type regression network. The proposed structure consists of two LSTM networks and a two-layer MLP network. There are 1024 internal states in each of the two LSTMs. The input of the MLP-type regression network is equal to the number of states in the second LSTM, the number of hidden units is 256, and there are two output units with tangent hyperbolic activation functions which correspond to estimated scores on arousal and valence dimensions. Figure 6 shows the LSTM with MLP-type regression network.
To evaluate the performance of the proposed model, a dataset with 12,900 videos clips is constructed by MediaEval 2015 Affective Impact of Movie task [20], and selected YouTube; about 9,800 video clips from MediaEval 2015 and the rest from our own selection of YouTube. The MediaEval data includes movies with proper labels of arousal and valence values on {−1, 0, 1}, while the selected YouTube consists of dramas, sports, and so on. The criterion for video selection from YouTube is that the content is supposed to elicit strong emotional reactions to viewers such as happiness, excitement, fear, anger, and so on. According to the adjectives, video clips whose lengths are from 7 to 18 seconds have been collected. Then we have annotated scores on the valence and arousal dimensions by one of five levels, i.e., [−1, −0.5, 0, 0.5, 1]. Take the valence dimension for instance, −1 indicates that there is negative emotion in the video content, while 1 stands for the positive emotion. Figure 7 shows screenshot examples of the video clips and their annotations.
Among 12, 900, we have randomly selected 11, 000 video clips for training, and the remaining 1, 400 for validation, and 500 for testing. Then the pre-trained weights C3D from benchmark Sports-1M dataset are used to initialize the C3D, we extracted the spatiotemporal features for video clip using the proposed feature extraction method.
The connection of C3D and LSTM networks is shown in Figure 8, where every set of successive 16 frames overlapped with 8 frames produces a set of spatiotemporal features to feed into LSTM network, which means the size of the temporal window is 16 frames with the stride of 8 frames. The LSTM network produces the summarized dynamic behavior over 60 temporal windows, which implies an emotion is evaluated for every sequence of 488 frames. Because the maximum number of windows, which corresponds to the number of recursions in LSTM network, is assumed to be 60, the clips with the smaller number of windows than 60 perform zero-padding.
The captured dynamic features of time-varying features for consecutive 488 frames are fed into regression network in order to estimate the emotion.
For the experiment, Intel Core i5-6600 CPU with the GTX 1080 Ti GPU has been used. The proposed model is implemented with TensorFlow library. The loss function to minimize for end-to-end learning is the mean squared error between the estimated emotion and ground truth of the video clips. Also, to avoid overfitting,
In addition, we used dropout when the first and second layers of LSTM results are loaded to the second layer of LSTM*1 and MLP-type regression network*2, respectively. The training method is summarized in Table 1.
For evaluation metrics, we choose the CCC, the Pearson correlation coefficient (PCC), and R2-score. CCC is a measure of agreement between two variables on [−1,
Table 2 shows the results of the emotion evaluation with the proposed model when the highest level with or without the lower level spatiotemporal features from C3D are used as input to the LSTM network. As we expected, the highest level with the lower level features as in the proposed method provides the better results. This implies not only the highest-level features for action recognition but also the lower level spatiotemporal features are helpful to improve the emotional evaluation of a video clip.
Figure 9 shows the evaluation results of valence and arousal scores in the Thayer’s emotional space. The proposed model produces the similar evaluation scores as the human can feel on the video clips. Note that the video clips are well aligned with the adjectives, for example, nervous video clips can be easily distinguished from pleasant ones.
Table 3 shows performance comparisons of the proposed model with different time steps of LSTM, which correspond to the length of video clips for emotion evaluation. Note that the proper length of video clips is important to the better performance of emotion evaluation. Evidently, 20 frames are not enough to evaluate proper emotion from video clips similar to human being. Note that the 20 and 60 frames correspond to 2/3 and 2 seconds, respectively, when the frame rate is assumed to be 30/sec.
Table 4 shows that the CCC values of different methods. Because the video clips for the emotional evaluation are different to each other and our method uses only the visual features without audio of a video clip, the direct comparison is not possible. But our method gives CCC values of 0.6024 for valence and 0.6460 for arousal, respectively. Note that our proposed method provides better performance than the methods of Baveye et al. [20] and Gan et al. [21], even though we use only visual information. Also, our method shows comparable or better performance with the method of Zhang and Zhang [19] where it uses carefully designed and selected handcrafted features. The results show the proposed model is promising in the emotional evaluation of video clips.
In this paper, we have proposed a deep learning model for an emotional evaluation of a video clip. In the proposed model, the pre-trained C3D network generates short-term spatiotemporal video features in the various levels, LSTM network accumulates the consecutive time-varying features to characterize long-term dynamic behaviors, and MLP network evaluates emotion of a video clip by regression mapping on the Thayer’s 2D emotion space.
Due to the limited number of labeled data, the C3D features of diverse levels are extracted by the transfer learning technique. The pre-trained C3D on the Sports-1M dataset and LSTM network followed by the MLP for regression mapping are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of two LSTMs and the MLP-type emotion estimator.
For the training of the model and the performance evaluation, MediaEval 2015 and our own selected and annotated YouTube data have been used. Through the experiments, we have empirically shown that both the low-level and high-level spatiotemporal features are useful to improve the accuracy of emotion evaluation on the valence and arousal dimensions, and proper LSTM network is useful to characterize the long-term temporal dynamic behavior like emotion. The proposed method achieves the CCC values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space. Our future work will be conducted in two directions. The one is to find more effective structures of deep models for audio and video spatiotemporal features extraction. The other is to develop an efficient method for multimodal emotion evaluation.
No potential conflict of interest relevant to this article was reported.
Differences between 2D and 3D convolution operations [
Screenshot samples of video clips and their corresponding annotated valance (V) and arousal (A) values.
The Thayer’s emotional space based on the emotional evaluation results of the proposed method.
Table 1. The training methods.
Method | Parameter |
---|---|
Batch size | 50 |
Number of epochs | 500 |
Dropout | 0.7*1, 0.5*2 |
Weight initialization in LSTM/MLP regression network | Xavier |
Weight decay | 0.0005 |
Learning rate | le-4 |
Optimizer | RMSProp |
Table 2. Emotional evaluation results of the video clip.
Valence | Arousal | |||||
---|---|---|---|---|---|---|
CCC | PCC | CCC | PCC | |||
High-level only | 0.5381 | 0.5464 | 0.2986 | 0.5682 | 0.5701 | 0.3251 |
Proposed method | 0.6024 | 0.6049 | 0.3661 | 0.6460 | 0.6471 | 0.4188 |
Table 3. Emotional evaluation of video clips for the LSTM time step.
LSTM time step | Valence | Arousal | ||||
---|---|---|---|---|---|---|
CCC | PCC | CCC | PCC | |||
20 | 0.3078 | 0.3251 | 0.1124 | 0.2671 | 0.2876 | 0.1754 |
60 | 0.6024 | 0.6049 | 0.3661 | 0.6460 | 0.6471 | 0.4188 |
Table 4. Performance comparisons of the emotional evaluation of video clip.
Method | CCC | |
---|---|---|
Valence | Arousal | |
Baveye et al. [20], with audio visual information | 0.542 | 0.645 |
Gan et al. [21], with audio visual information | 0.398 | 0.429 |
Zhang and Zhang [19] with only video information | 0.645 | 0.542 |
Proposed method |
E-mail: breed213@jbnu.ac.kr
E-mail: chlee@jbnu.ac.kr
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(4): 245-253
Published online December 31, 2018 https://doi.org/10.5391/IJFIS.2018.18.4.245
Copyright © The Korean Institute of Intelligent Systems.
Byoungjun Kim, and Joonwhoan Lee
Division of Computer Science and Engineering, Chonbuk National University, Jeonnju, Korea
Correspondence to: Joonwhoan Lee, (chlee@jbnu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emotional evaluation of video clips is the difficult task because it includes not only stationary objects as the background but also dynamic objects as the foreground. In addition, there are many video analysis problems to be solved beforehand to properly address the emotion-related tasks. Recently, however, the convolutional neural network (CNN)-based deep learning approach, opens the possibility by solving the action recognition problem. Inspired by the CNN-based action recognition technology, this paper challenges to evaluate the emotion of video clips. In the paper, we propose a deep learning model to capture the video features and evaluate the emotion of a video clip on Thayer 2D emotion space. In the model, the pre-trained convolutional 3D neural network (C3D) generates short-term spatiotemporal features of the video, LSTM accumulates those consecutive time-varying features to characterize long-term dynamic behaviors, and multilayer perceptron (MLP) evaluates emotion of a video clip by regression on the emotion space. Due to the limited number of labeled data, the C3D is employed to extract diverse spatiotemporal from various layers by transfer learning technique. The pre-trained C3D on the Sports-1M dataset and long short term memory (LSTM) followed by the MLP for regression are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of LSTM and the MLP-type emotion estimator. The proposed method achieves the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space.
Keywords: Video emotion analysis, C3D, Transfer learning, LSTM
Automatic emotional evaluation of diverse media is a long-standing problem in multimedia analysis and artificial intelligence, which requires various component technologies, including feature extraction and selection from media, emotion analysis, and its evaluation. There have been many related types of research on automatic emotion evaluation. Yazdani et al. [1] has performed multimedia content analysis for emotional characterization of music video clips after extracting the audio and visual features. Also, Arifin and Cheung [2] has proposed the video content-based emotion analysis model for high-level video parsing problem. Zhao et al. [3] has proposed a video indexing and recommendation method based on emotion analysis focusing on the viewer’s facial expression recognition. However, emotional evaluation of video is still a difficult and challenging task because it includes not only stationary objects as the background but also dynamic objects as the foreground.
Recently, diverse convolutional neural network (CNN)-type deep learning techniques produce a big success in video analysis. However, traditional two-dimensional CNN (2DCNN) is not well suited for video because it handles only spatial information on a single image. Therefore, the recent researches on video analysis have been conducted with 3DCNN [4] and recurrent neural network (RNN) [5]. Unlike 2DCNN, the 3DCNN explores and exploits spatiotemporal features, even though the captured temporal feature is restricted within the short-time duration of the temporal window. Also, RNN can take a sequential input to characterize dynamic features and make the temporal decision associated with a relatively long-term behavior. Therefore, it has been successfully applied to various fields such as time series and video analysis.
In general, the most of the studies on deep learning-based video emotion analysis has been focused on the dynamic tracking of human facial expressions in video clips and the classification of them. The authors [6–8] have proposed deep learningbased facial image classification that uses the dynamic tracking of human facial expression in a video clip, where 2DCNN and RNN are combined in the structure. Also, Fan et al. [9] has proposed a facial emotion recognition for the combining 2DCNN-RNN and C3D. Vielzeuf et al. [10] has proposed a model of convolutional 3D neural network (C3D) and long short term memory (LSTM) for facial emotion recognition.
However, the structure combining 2DCNN and RNN for video analysis may not be sufficient to explore the information contained in video clips, because it extracts only spatial features from individual frames by 2DCNN, and shifts the responsibility to characterize the dynamic behavior to RNN. So the optical flow has been extracted to characterize the dynamic behavior separately to combine with the spatial information in video analysis [11]. The Vielzeuf’s approach of C3D followed by LSTM, however, the C3D can extract both spatial and temporal behavior in a short time duration effectively, and succeeding LSTM can summarize the dynamic behavior to successfully perform facial emotion recognition.
In our conjecture, there are two problems when we directly employ the structure for facial emotion recognition to the evaluation of generic video emotion. One is the related scope of spatiotemporal features in C3D, and the other is the time duration that human can perceive the emotion.
In this paper, we propose a deep learning model for an emotional evaluation of the video clips based on the conjectures. In the proposed structure, various short-term spatiotemporal features with different scopes are extracted from video clips using the C3D, and those features are aggregated to evaluate relatively long-term dynamic behavior through the LSTM [12].
Because a general type of video has diverse scenes with various shot speeds differently from the face, it is difficult to say that emotions depend only on high-level spatiotemporal features of the C3D. Rather the spatiotemporal features from various levels in the C3D could be useful to evaluate the general type of emotion of video clips. Note that the features from different levels in the C3D have their own scopes of 3D receptive fields in both spatial and temporal dimensions, where the features from lower layers have the smaller receptive fields than those from higher layers. In the proposed model, the emotion for a video clip is assumed to be dependent on the diverse spatiotemporal features from low to high levels of the C3D.
In general, video clips have the different length. So in the proposed model, the spatiotemporal features are temporally resized to be fixed by zero-padding before feeding into LSTM for temporal summarization. According to our conjecture, the summarization of the spatiotemporal features from C3D for a general emotion needs to take a longer duration than the facial expression recognition. The number of recursions in two-layer LSTM subnetwork reflects this conjecture in our model.
In the proposed model, the long-term dynamic features characterized by LSTM are mapped into Thayer 2D emotion space, which consists of valence and arousal dimensions, by the multilayer perceptron (MLP)-type regression model. Therefore, the final emotion is estimated by the regression as a pair of valence and arousal scores in the emotion space.
In general, this type of data-driven machine learning approach requires a large amount of training data. Because the amount of available video emotion database contains only small labeled data, however, the proposed model uses adopts a transfer learning technique by taking the pre-trained C3D on the Sports-1M dataset of to capture the diverse spatiotemporal features. In the end-to-end training, the parameters of pre-trained C3D are fine-tuned with adjusting the weights of LSTM and regression network simultaneously. After training the video emotion one can feel when experiencing a video clip is evaluated in the 2D emotion space.
The experimental results show that the concordance correlation coefficient values of 0.6024 for valence and 0.6460 for arousal, respectively. This shows that the deliberate construction of a deep learning structure, C3D and then LSTMs followed by the regression network, can provide excellent performance to obtain the emotion, which is one of the high-level video semantics. Therefore, the contribution of this paper can be described as follows.
First, we propose a deep learning based emotion evaluation model of a general type of video clips, which consists of C3D, and then LSTMs followed by regression network.
Second, the proposed model is constructed based on two conjectures, one is the emotion can be well captured by various levels of spatiotemporal features from C3D, and the other is the general emotion from a video clip needs to summarize dynamic features for a longer time than facial expression in RNN.
Third, the proposed model can produce excellent performance compared with other methods.
The paper is organized as follows: the related works are reviewed in Section 2, the details of our proposed method are given in Section 3, and the experimental results and analysis are presented in Section 4. Finally, concluding remarks are given in Section 5.
In general, the emotion of multimedia contents that human feel when they are experienced can be expressed with adjectives. But there are too many adjective terms to be used in the expression. For this reason, psychological experiment with statistical analysis has been done to reduce the number of vocabulary and to define a compressed emotion space, on which adjectives are distinguished as a position. For example, Kawamoto and Soen [13] has analyzed emotions of color patterns with 13 adjective pairs, and Lee abd Park [14] has defined three pairs of adjectives consisting of “warm-cold”, “heavy-light” and “dynamic-static” for emotional evaluation of color images. Another well-known 2D emotion space defined by Thayer [15] is the valence-arousal space. In the space, the valence dimension refers to the positive/negative intrinsic attractiveness, and the arousal dimension expresses the degree of being state of awoken or of sense organs stimulated to a point of perception. According to the amounts of valence and arousal, 12 distinctive adjectives can be located on the boundary of a unit circle as in Figure 1. This paper uses this continuous Thayer’s emotion space to evaluate video and the result is given by a position on the unit circle in 2D space.
In earlier research of CNN-based video analysis, transfer learning has been used [16], where features are independently extracted from the individual frame of video. However, there is a problem in video analysis, because it ignores the temporal information involved in a video clip because only the spatial features are considered. Donahue et al. [17] has proposed a network model combining 2DCNN and LSTM for the video analysis. This method extracts spatial features sequentially from consecutive frames using 2DCNN and then shifts the responsibility to capture the temporal information to LSTM. But this structure provides a limited performance because the 2DCNN extracts only spatial features and the LSTM may have too much burden to extract temporal features and characterize the dynamic features simultaneously. In addition, the 2DCNN with LSTM structure usually extracts the most abstract spatial features from the last fully connected layer following the consecutive 2DCNNs, and feeds them into LSTM. They do not utilize the low-level spatial features from the lower CNN layers so that there might occur information loss.
Simonyan and Zisserman [11] have shown a high performance on video analysis with a two-stream network that consists of two branches of 2DCNNs, one for capturing spatial features and another for temporal behavior from optical flow. But the two-stream network requires a large number of computations because the optical flow must be created in the preprocessing process.
Recently, the focus of video analysis has changed according to the emergence of 3DCNN. A 3DCNN has provided good results in video analysis because it captures spatiotemporal features by 3D convolution kernels and pooling operations. Figure 2 shows that the difference between 2D and 3D convolution operations. Note the scope of the temporal features in 3DCNN is restricted because the number of layers should be increased to capture the long-term temporal features that make the training difficult. Therefore, this 3DCNN captures temporal features defined only in a short-time duration, and not appropriate to characterize a long-term dynamic behavior.
In general, 3D convolution receives
where (
The C3D can model appearance and motion information simultaneously and outperforms the 2DCNN features on various video analysis tasks. In addition, the deeper architecture with a uniform 3 × 3 × 3 kernel size has been empirically verified to produce the best result [18]. The C3D input dimension is 112 × 112 × 3 × 16, consisting of 5 groups of convolutional layers, 5 pooling layers, 2 fully connected layers and softmax. The numbers of filters in the convolution layers are 64, 128, 256, 512, and 512, respectively. There is no size change between convolution layers, that is 3 × 3 × 3 with stride 1 × 1 × 1 and proper zero-padding. All pooling layers used max pooling of kernel size 2 × 2 × 2 except pool1 to preserve the temporal information in the early phase. Figure 3 shows that the C3D structure.
In this section, the proposed emotion evaluation model is explained in detail, including how to extract the spatiotemporal features of video clips using C3D and how to evaluate the emotion using LSTM with MLP-type regression network.
The learning ability of a deep learning model is related to the network depth. In general, even though a deeper network increases the expressiveness, it also makes the training and optimization difficult. In addition, the deeper network with a large number of parameters requires a large amount of training data to avoid overfitting problem. In the proposed structure, various short-term spatiotemporal features are extracted from a video clip using C3D based on transfer learning technique, and those features are aggregated to characterize the long-term dynamic behavior through LSTM. In the evaluation process, emotion is estimated by regression mapping realized by the MLP to valence and arousal axes on Thayer 2D emotion space. Figure 4 shows the proposed structure of emotional evaluation model.
The C3D can act as a feature extractor for video analysis [18]. A video clip passing to the C3D produces spatiotemporal features through the successive convolutional layers and corresponding max-pooling layers. In video clips, it is difficult to say that emotions depend only on the highest-level spatiotemporal features of the C3D. Rather the spatiotemporal features from various levels in C3D layers could be useful to evaluate an emotion from the general type of video clips.
In the proposed method, each video clips passing to the C3D extract the 960 features from the 1st to the 4th max-pooling layers through GAP (global average pooling) and concatenate them with the 4096-dimensional feature vectors in the first fully connected layer. Note that the GAP results of the 1st thru the 4th layer represent the globally averaged spatiotemporal features of the smallest and the largest ranges of receptive fields, respectively. Also, the outputs from the fully connected layer can be interpreted as the most highest-level features for generic action recognition. Therefore, the spatiotemporal features in the proposed model include all the globally aggregated ones with diverse spatiotemporal scopes. Note that the local spatiotemporal features are available unless we take GAP, but it could increase the number of features to make a burden to the following stage of LSTM. This is the reason why we adopt the GAP to aggregate the spatiotemporal features at each level of C3D. Figure 5 shows the low-level and high-level spatiotemporal feature extraction.
In recent study, LSTM has been used to estimate the emotion of a video clip in the same Thayer’s emotion space. But the hand-crafted audio and video features were extracted, then only the selected features were exploited to estimate the degrees of arousal and valence in the LSTM [19]. In the work, the idea of LSTM is similar to ours, in that it takes a role to characterize a long-term dynamic behavior of video clips. In our proposed model, however, LSTM is adjusted concurrently with the fine-tuning of C3D to automatically define the spatiotemporal features in end-to-end training. Note that the duration of a set of C3D features is only 16 frames, so that the temporal components is not enough to define the complicate long-term emotions unless the original video frames are subsampled. That’s why LSTM is necessary to be added for summarizing the long-term temporal behavior.
In the proposed method, we evaluate the emotion of the extracted spatiotemporal features using LSTM with MLP-type regression network. The proposed structure consists of two LSTM networks and a two-layer MLP network. There are 1024 internal states in each of the two LSTMs. The input of the MLP-type regression network is equal to the number of states in the second LSTM, the number of hidden units is 256, and there are two output units with tangent hyperbolic activation functions which correspond to estimated scores on arousal and valence dimensions. Figure 6 shows the LSTM with MLP-type regression network.
To evaluate the performance of the proposed model, a dataset with 12,900 videos clips is constructed by MediaEval 2015 Affective Impact of Movie task [20], and selected YouTube; about 9,800 video clips from MediaEval 2015 and the rest from our own selection of YouTube. The MediaEval data includes movies with proper labels of arousal and valence values on {−1, 0, 1}, while the selected YouTube consists of dramas, sports, and so on. The criterion for video selection from YouTube is that the content is supposed to elicit strong emotional reactions to viewers such as happiness, excitement, fear, anger, and so on. According to the adjectives, video clips whose lengths are from 7 to 18 seconds have been collected. Then we have annotated scores on the valence and arousal dimensions by one of five levels, i.e., [−1, −0.5, 0, 0.5, 1]. Take the valence dimension for instance, −1 indicates that there is negative emotion in the video content, while 1 stands for the positive emotion. Figure 7 shows screenshot examples of the video clips and their annotations.
Among 12, 900, we have randomly selected 11, 000 video clips for training, and the remaining 1, 400 for validation, and 500 for testing. Then the pre-trained weights C3D from benchmark Sports-1M dataset are used to initialize the C3D, we extracted the spatiotemporal features for video clip using the proposed feature extraction method.
The connection of C3D and LSTM networks is shown in Figure 8, where every set of successive 16 frames overlapped with 8 frames produces a set of spatiotemporal features to feed into LSTM network, which means the size of the temporal window is 16 frames with the stride of 8 frames. The LSTM network produces the summarized dynamic behavior over 60 temporal windows, which implies an emotion is evaluated for every sequence of 488 frames. Because the maximum number of windows, which corresponds to the number of recursions in LSTM network, is assumed to be 60, the clips with the smaller number of windows than 60 perform zero-padding.
The captured dynamic features of time-varying features for consecutive 488 frames are fed into regression network in order to estimate the emotion.
For the experiment, Intel Core i5-6600 CPU with the GTX 1080 Ti GPU has been used. The proposed model is implemented with TensorFlow library. The loss function to minimize for end-to-end learning is the mean squared error between the estimated emotion and ground truth of the video clips. Also, to avoid overfitting,
In addition, we used dropout when the first and second layers of LSTM results are loaded to the second layer of LSTM*1 and MLP-type regression network*2, respectively. The training method is summarized in Table 1.
For evaluation metrics, we choose the CCC, the Pearson correlation coefficient (PCC), and R2-score. CCC is a measure of agreement between two variables on [−1,
Table 2 shows the results of the emotion evaluation with the proposed model when the highest level with or without the lower level spatiotemporal features from C3D are used as input to the LSTM network. As we expected, the highest level with the lower level features as in the proposed method provides the better results. This implies not only the highest-level features for action recognition but also the lower level spatiotemporal features are helpful to improve the emotional evaluation of a video clip.
Figure 9 shows the evaluation results of valence and arousal scores in the Thayer’s emotional space. The proposed model produces the similar evaluation scores as the human can feel on the video clips. Note that the video clips are well aligned with the adjectives, for example, nervous video clips can be easily distinguished from pleasant ones.
Table 3 shows performance comparisons of the proposed model with different time steps of LSTM, which correspond to the length of video clips for emotion evaluation. Note that the proper length of video clips is important to the better performance of emotion evaluation. Evidently, 20 frames are not enough to evaluate proper emotion from video clips similar to human being. Note that the 20 and 60 frames correspond to 2/3 and 2 seconds, respectively, when the frame rate is assumed to be 30/sec.
Table 4 shows that the CCC values of different methods. Because the video clips for the emotional evaluation are different to each other and our method uses only the visual features without audio of a video clip, the direct comparison is not possible. But our method gives CCC values of 0.6024 for valence and 0.6460 for arousal, respectively. Note that our proposed method provides better performance than the methods of Baveye et al. [20] and Gan et al. [21], even though we use only visual information. Also, our method shows comparable or better performance with the method of Zhang and Zhang [19] where it uses carefully designed and selected handcrafted features. The results show the proposed model is promising in the emotional evaluation of video clips.
In this paper, we have proposed a deep learning model for an emotional evaluation of a video clip. In the proposed model, the pre-trained C3D network generates short-term spatiotemporal video features in the various levels, LSTM network accumulates the consecutive time-varying features to characterize long-term dynamic behaviors, and MLP network evaluates emotion of a video clip by regression mapping on the Thayer’s 2D emotion space.
Due to the limited number of labeled data, the C3D features of diverse levels are extracted by the transfer learning technique. The pre-trained C3D on the Sports-1M dataset and LSTM network followed by the MLP for regression mapping are trained in end-to-end manner to fine-tune the C3D, and to adjust weights of two LSTMs and the MLP-type emotion estimator.
For the training of the model and the performance evaluation, MediaEval 2015 and our own selected and annotated YouTube data have been used. Through the experiments, we have empirically shown that both the low-level and high-level spatiotemporal features are useful to improve the accuracy of emotion evaluation on the valence and arousal dimensions, and proper LSTM network is useful to characterize the long-term temporal dynamic behavior like emotion. The proposed method achieves the CCC values of 0.6024 for valence and 0.6460 for arousal, respectively. We believe this emotional evaluation of video could be easily associated with appropriate music recommendation, once the music is emotionally evaluated in the same high-level emotional space. Our future work will be conducted in two directions. The one is to find more effective structures of deep models for audio and video spatiotemporal features extraction. The other is to develop an efficient method for multimodal emotion evaluation.
Thayer’s 2D emotion space [
Differences between 2D and 3D convolution operations [
The C3D structure [
The proposed emotion evaluation model.
The proposed spatiotemporal feature extraction.
The LSTM with MLP-type regression structure.
Screenshot samples of video clips and their corresponding annotated valance (V) and arousal (A) values.
The combining LSTM with MLP-type regression network.
The Thayer’s emotional space based on the emotional evaluation results of the proposed method.
Table 1 . The training methods.
Method | Parameter |
---|---|
Batch size | 50 |
Number of epochs | 500 |
Dropout | 0.7*1, 0.5*2 |
Weight initialization in LSTM/MLP regression network | Xavier |
Weight decay | 0.0005 |
Learning rate | le-4 |
Optimizer | RMSProp |
Table 2 . Emotional evaluation results of the video clip.
Valence | Arousal | |||||
---|---|---|---|---|---|---|
CCC | PCC | CCC | PCC | |||
High-level only | 0.5381 | 0.5464 | 0.2986 | 0.5682 | 0.5701 | 0.3251 |
Proposed method | 0.6024 | 0.6049 | 0.3661 | 0.6460 | 0.6471 | 0.4188 |
Table 3 . Emotional evaluation of video clips for the LSTM time step.
LSTM time step | Valence | Arousal | ||||
---|---|---|---|---|---|---|
CCC | PCC | CCC | PCC | |||
20 | 0.3078 | 0.3251 | 0.1124 | 0.2671 | 0.2876 | 0.1754 |
60 | 0.6024 | 0.6049 | 0.3661 | 0.6460 | 0.6471 | 0.4188 |
Bhuwan Bhattarai, and Joonwhoan Lee
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(2): 88-96 https://doi.org/10.5391/IJFIS.2019.19.2.88Yagya Raj Pandeya, and Joonwhoan Lee
Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160 https://doi.org/10.5391/IJFIS.2018.18.2.154Thayer’s 2D emotion space [
Differences between 2D and 3D convolution operations [
The C3D structure [
The proposed emotion evaluation model.
|@|~(^,^)~|@|The proposed spatiotemporal feature extraction.
|@|~(^,^)~|@|The LSTM with MLP-type regression structure.
|@|~(^,^)~|@|Screenshot samples of video clips and their corresponding annotated valance (V) and arousal (A) values.
|@|~(^,^)~|@|The combining LSTM with MLP-type regression network.
|@|~(^,^)~|@|The Thayer’s emotional space based on the emotional evaluation results of the proposed method.