International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(2): 88-96
Published online June 25, 2019
https://doi.org/10.5391/IJFIS.2019.19.2.88
© The Korean Institute of Intelligent Systems
Bhuwan Bhattarai, and Joonwhoan Lee
Chonbuk National University, Jeonju, Korea
Correspondence to :
Joonwhoan Lee, (chlee@chonbuk.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper proposes an automatic mood detection of music with a composition of transfer learning and multilayer. The five layered convolutional neural network pre-trained on Million Song dataset is used to extract the features from EmoMusic dataset. We obtain a set of features from the different five layers, which is fed into multilayer perceptron (MLP)-based regression. Through the regression network we estimate the mood of music on Thayer’s two-dimensional emotion space, which consists of the axes corresponding to arousal and valence. Because the EmoMusic dataset does not provide enough number of data for training, we augment the data by time stretching to make it tripled. We perform the experiment with the augmented data as well as the original EmoMusic dataset. Box and whisker plot along with the mean of 10-fold cross-validation has been used for evaluating the proposed mood detection. In terms of the percentage of
Keywords: Multilayer perceptron (MLP), Mood detection, Music, Transfer learning, Global average pooling
People enjoy music in their leisure time. Now a days there is more and more music on personal computers, on mobile phone and on the Internet. The huge number of music can be specified by mood, which can help us to easily retrieve a proper set of music. There are two major aspects for automatically evaluating mood of music; one is mood classification [1, 2] and the other is mood regression. In the mood classification the category named by an adjective term is automatically given to a song. In the mood regression the mood is evaluated by values on the proper scales, each of which represents the degree of a specific emotion state.
Whatever types of the detection we take, it is necessary to specify an appropriate emotion space to evaluate the mood. Even though there is no standardized space, however, Thayer’s two-dimensional emotion space model as in Figure 1 is well recognized as such space, in which the fundamental aspects of music mood are represented in a two-dimensional space of valence and arousal [3]. In the model, valence axis describes the continuous scale from pleasantness (happy, pleased, hopeful, etc.) to unpleasantness (unhappy, annoyed, hopeful, etc.). The arousal axis represents the degree of calming or exciting. Each axis spans the bipolar scale values from −1 to 1. The emotion is represented by a point or region [4] based on the values of coordinates, each of which corresponds to the positive or negative degrees of the feelings. Sometimes the mood categories can be designated in the two-dimensional emotion space as in Figure 1 [5].
The authors [2, 6] concludes that number of musical features like tempo, pitch, rhythm and harmony makes listener perceive and correctly identify a specific intended emotional expressions of mood like sad, fear, humorous, happy, and exiting. The study of lyric-based music mood recognition [7] has been performed to observe the relationship between content of lyrics and mood using OPM songs. KeyGraph keyword generation algorithm and word level features like TF-IDF have been examined by using various parameters and threshold to extract the keywords from lyrices. Automatic music mood recognition using support vector regression [8, 9] is studied by mapping the various music features into Thayer’s two-dimentional emotion space model which can later predict the values for valence and arousal. Similarly a 5-layer convolutional neural network (CNN) trained on Million Song dataset [10] for music tagging is used to extract the features to detect the music mood with respect to valence and arousal [11]. Similar audio network [
The deep neural network needs large amount of labeled dataset, but music information retrieval community (MIR) still have scarcity of such data especially for the music mood. Transfer learning also called inductive transfer is one of the way that can resolve the problem of limited number of labelled data set. Transfer learning reuses knowledge of the source domain to solve a new task in the target domain. For example, knowledge gained while recognize cars could apply when trying to recognize trucks [13]. Transfer learning has been applied successfully in many domains such as visual object recognition [14], web document classification [15, 16], and human action recognition [17]. There were several studies [18–21] in the field of machine learning using transfer learning approach. A natural RGB images to malignancy classification is used in [22] for contrast-enhanced MR images. The POS tagging on Penn Treebank data is used to transfer the knowledge of source task to target task POS tagging on microblogs with limited available annotation for sequence tagging data [23]. Similarly, the patterns of human epithelial type 2 (HEp-2) cells are studied by applying transfer learning from a pre-trained deep CNN to extract the generic features [24].
In this work, we extract the features of original and augmented EmoMusic dataset by reusing the layer-wise weights from the pre-trained network [11]. The pre-trained network is trained on Million Song dataset for music tagging. After extracting features, multilayer perceptron (MLP) with three or four hidden layers trained with 50% dropout is used to detect the scores in valence and arousal. In the experiment the number of data is increased by augmenting the original Emo-Music dataset using time stretching. The ground truth label for augmented data in valence and arousal dimension is set to be identical as original data. The experimental result of 10-fold cross-validation exhibits that MLP with the original EmoMusic dataset is similar to state-of-the-art accuracy in SOA(1) [11] and SOA(2) [25]. The highest
The organization of this paper is as follows. In Section 2, the methods of audio, feature extraction, mood detection using the MLP for regression and data augmentation method are described. Section 3 presents about the experimental results. Finally, we draw the conclusion in Section 4.
The audio file from the EmoMusic dataset is preprocessed using Librosa library [26] to generate the Mel-spectrogram. The single channel audio files amounts to the length of 29 seconds, with sampling rate of 12, 000 Hz. The size of FFT is 512 with the hop length of 256, which produces 1, 360 frames for a song. The number of Mel-bins is 96 so that the size of the generated Mel-spectrogram is 96×1360, which is fed into the pre-trained network for feature extraction.
The pre-trained CNN has the goal to automatically generate multiple tagging. The network has been trained with Million Song data set, which gives the tags among 50. In each layer this pre-trained convolutional network consists of consecutive 5 blocks, each of which has a set of 3×3 convolution filters, max-pooling operation, and batch normalization, and then followed by a fully connected layer for the multiple tagging [11]. Every unit in each layer has exponential linear unit (ELU) as activation function. The 50 units of the last fully connected layer have the sigmoidal activation functions to generate class-belongings corresponding to the tags.
Figure 2 shows the CNN after removing the fully connected layer for the tag generation which is nothing to do with the feature extraction. The structure of convolutional network for feature extraction is summarized in Table 1.
After preprocessing the 96 × 1360 Mel-spectrogram of Emo-Music dataset is fed into the pre-trained network to generate feature maps. Each feature maps has its own size and each units in a map has its own receptive field in Mel-frequency bin and time dimensions.
We assume that the whole music has consistently continued the same mood. Note that one can divide the whole music into several segments, try to detect the mood for each time segments, and summarize every mood of the segment to obtain the mood of whole music.
Because there are 32 channels with different set of filters in each layer, we can take 32 features by global average pooling. The 32 features from each of 5 layers can be separately exploited or concatenated to obtain a higher dimensional features. Therefore, the maximum dimension of the output feature is 160 when we concatenate all five layers.
Note that one can divide a feature map into several groups according to the Mel-frequency bin. In this case, we could take more features for a map. In the paper we do not consider the method, because there might be too huge number of features and it is necessary to consider the feature reduction technique.
After obtaining the features from the pre-trained network, the mood regressor is constructed by a MLP. By the empirical experiment we set three or four hidden layers, and one output layer as shown in Figure 3. The numbers of units in the hidden layers are also empirically set according to the data size in the experiment. The last layer has two nodes, each of which evaluates the mood in the Thayer’s two-dimensional scales of valence and arousal. Because the value of each bipolar dimension spans [−1, 1], the tangent hyperbolic is adapted as activation function [27]. We also construct same size MLP for both original and augmented data in our experiment. However, we didn’t mention the result because the amount of original data is not sufficient for huge size MLP so that resulting accuracy is decrease and network seems to be overfitted.
To avoid over fitting in the training, 50% of dropout [28] is applied to the units in each hidden layer. The network is trained with 10-fold cross-validation with stratified splits. Adam optimizer [29] is used for updating weight and bias to minimize the mean squared error (MSE) loss function. During training batch size of 100 and epochs of 300 are used.
As mentioned earlier there are 32 features from each layer of pre-trained network by taking the global average pooling. In the experiment, we compare the performance for the hierarchical features from each different layer of pre-trained CNN. In addition to the 5 sets of 32-dimensional features, a set of all the concatenated features from 5 layers is considered to compare the performance.
For the comparison, we use each of these six sets of features for training and obtain the percentage of
In the data-driven approach, the large number of data is essential to avoid overfitting. Therefore, the augmentation is required to artificially increase the number of data, especially when the number of data is not sufficient. Because the expensive psychological experiment is necessary to obtain the set of labelled data for mood detection, the size of labeled set is usually not so big for the data-driven approach.
The EmoMusic dataset has only 744 labelled data for mood regression. Therefore, we try to increase the number of Emo-Music dataset by time stretching with the modest scaling rates. Note that the too much shortened or stretched audio data can change pitch and tempo, which again makes the mood change differently from the original data. We use a randomly selected scaling factor in the small interval around 1.0 as [0.95, 1.05]. Note that the larger(less) than 1 scaling factor produces time stretched(shortened) audio data. After augmentation, we can have 3 × 744 new EmoMuisc data including two times of time stretched and the original data.
We use Librosa library for increasing the number of data by the time stretching. For time-stretching the EmoMusic dataset, at first, short time Fourier transform (STFT) of audio data is computed. Then we draw two randomly chosen stretching or shrinking rates from the uniform distribution in the interval.
In order to perform the time stretching, we take STFT of audio data. Then the STFT matrix of time-frequency representation is stretched or shorten based on the phase vocoder implemented in [30]. After the time stretching, the resulting time-frequency representation can be transformed back into the audio data in time dimension with the inverse STFT (ISTFT).
Each music in EmoMusic dataset used in our experiment has its own emotionally evaluated scores in two scales of valence and arousal in Thayer’s emotion space. There are eight genres of music included in EmoMusic data set including Blues, Electronic, Rock, Classical, Folk, Jazz, Country and Pop, which is annotated via crowdsourcing named Amazon’s Mechanical Turk.
The two types of dataset are used in our experiment as in Table 2. One is original EmoMusic dataset [31] and another is our augmented EmoMusic data. The original dataset consists of 744 audio data and another data set additionally includes the 2×744 time stretched data. The emotional score for augmented data is the same as original data, which is a pair of continuous values from 1 to 9 in arousal and valence scales. In order to adapt to tangent hyperbolic activation function, each of the score is normalized from −1 to 1.
The
In each experiments in Table 2,
The original data of EmoMusic contains only 744 excerpts of music which is limited with MLP for large number of hidden nodes. So, we take a small sized network in number of hidden layers and the number of nodes in each hidden layer. The structure of MLP from the original data is summarized in Table 3. The results of experiment 1 are shown in Figure 4. The percentage of
Table 4 summarizes the structure of MLP for experiment 2. We add an additional hidden layer and increase the number of nodes in the hidden layers, because the number of data is increased in the experiment.
The results of the experiment are shown in Figure 5. The highest scores of the mean over 10-folds are achieved by
From the results of our experiments we can conclude that higher layer features as well as assorted features of all layers are better than lower layer features. The box and whisker plot in Figures 4 and 5 defines the distribution of
The results form experiment 2 are smaller IQR and shorter length of two whiskers than those from experiment 1, which implies the augmented data by time stretching is valid to obtain stable mood detection in the MLP-based regressor. This results coincide well with the conclusion that deep learning and other modern nonlinear machine learning techniques can obtain better result with data augmentation [34, 35].
Considering the stable results from experiment 2, we can carefully conclude that the features from the higher layer provides better results in mood estimation of music. Usually, the CNN features in higher layer have wider receptive fields. This implies the features extracted from longer time durations are more effective to detect the mood of music.
We compare the mean over 10-fold results from
In Table 5, MLP with the augmented data performs the best scores of 79.64% and 85.61% in valence and arousal scales, respectively. Similarly, MLP with original data in experiment 1 also performs better result in valence, which is around 6% higher than SOA(1), and 2% higher than SOA(2). But the score in arousal is 3% less than SOA(2) and 2% higher than SOA(1).
This paper proposes an automatic mood detection of music with a composition of transfer learning and multilayer perceptron. The 5-layered CNN pre-trained on Million Song dataset is used to extract the features from EmoMusic dataset. We obtain a set of features from the different five layers, which is fed into MLP-based regression. Through the regression network we estimate the mood of music on Thayer’s two-dimensional emotion space, which consists of the axes corresponding to arousal and valence. Because the EmoMusic dataset does not provide enough number of data for training, we augment the data by time stretching to make it tripled. We perform the experiment with the augmented data as well as the original EmoMusic dataset. Box and whisker plot along with the mean of ten-fold cross-validation has been used for evaluating the proposed mood detection. In terms of the percentage of
Based upon our results we conclude that data augmentation technique plays a significant role for stable and efficient detection of music mood. In addition, the features taken from the convolution filters with larger receptive field is more effective for the mood detection in MLP-based regressor.
We believe that the performance can be further improved by the ensemble strategy of time segmented emotional evaluations, and the deliberate feature selection method.
No potential conflict of interest relevant to this article was reported.
The research leading to these result, authors would like to thank Korean Ministry of Education for funding. The National Research Foundation of Korea (NRF) support us under Basic Science Research Program (NRF-2015R1D1A1A01058062).
Table 1. Structure of CNN for feature extraction.
Layer | Max-pooling | Filter-size | Channels | Activation |
---|---|---|---|---|
1 | 2 × 4 | 3 × 3 | 32 | ELU |
2 | 4 × 4 | 3 × 3 | 32 | ELU |
3 | 4 × 5 | 3 × 3 | 32 | ELU |
4 | 2 × 4 | 3 × 3 | 32 | ELU |
5 | 4 × 4 | 3 × 3 | 32 | ELU |
Table 2. EmoMusic and augmented data sets for experiment.
Experiments | Included data | No. of data |
---|---|---|
Experiment 1 | Original EmoMusic | 744 |
Experiment 2 | Original EmoMusic+Augmented data | 2232 |
Table 3. MLP structure for experiment 1 of original data.
Layer | No. of nodes | Activation | Dropout (%) |
---|---|---|---|
Input | 32 or 60 | - | - |
Hidden 1 | 200 | ReLu | 50 |
Hidden 2 | 100 | ReLu | 50 |
Hidden 3 | 20 | ReLu | 50 |
Output | 2 | Tanh | - |
Table 4. MLP structure for experiment 2 of augmented data.
Layer | No. of nodes | Activation | Dropout (%) |
---|---|---|---|
Input | 32 or 160 | - | - |
Hidden 1 | 1000 | ReLu | 50 |
Hidden 2 | 500 | ReLu | 50 |
Hidden 3 | 100 | ReLu | 50 |
Hidden 4 | 20 | ReLu | 50 |
Output | 2 | Tanh | - |
Table 5. Comparison with SOA(1) and SOA(2).
Valence | Arousal | |
---|---|---|
MLP (original) | 51.88 | 67.07 |
MLP (augmented) | 79.64 | 85.61 |
SOA(1) | 45.72 | 65.31 |
SOA(2) | 50.0 | 70.0 |
E-mail: bhubon240@gmail.com
E-mail: chlee@chonbuk.ac.kr
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(2): 88-96
Published online June 25, 2019 https://doi.org/10.5391/IJFIS.2019.19.2.88
Copyright © The Korean Institute of Intelligent Systems.
Bhuwan Bhattarai, and Joonwhoan Lee
Chonbuk National University, Jeonju, Korea
Correspondence to:Joonwhoan Lee, (chlee@chonbuk.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper proposes an automatic mood detection of music with a composition of transfer learning and multilayer. The five layered convolutional neural network pre-trained on Million Song dataset is used to extract the features from EmoMusic dataset. We obtain a set of features from the different five layers, which is fed into multilayer perceptron (MLP)-based regression. Through the regression network we estimate the mood of music on Thayer’s two-dimensional emotion space, which consists of the axes corresponding to arousal and valence. Because the EmoMusic dataset does not provide enough number of data for training, we augment the data by time stretching to make it tripled. We perform the experiment with the augmented data as well as the original EmoMusic dataset. Box and whisker plot along with the mean of 10-fold cross-validation has been used for evaluating the proposed mood detection. In terms of the percentage of
Keywords: Multilayer perceptron (MLP), Mood detection, Music, Transfer learning, Global average pooling
People enjoy music in their leisure time. Now a days there is more and more music on personal computers, on mobile phone and on the Internet. The huge number of music can be specified by mood, which can help us to easily retrieve a proper set of music. There are two major aspects for automatically evaluating mood of music; one is mood classification [1, 2] and the other is mood regression. In the mood classification the category named by an adjective term is automatically given to a song. In the mood regression the mood is evaluated by values on the proper scales, each of which represents the degree of a specific emotion state.
Whatever types of the detection we take, it is necessary to specify an appropriate emotion space to evaluate the mood. Even though there is no standardized space, however, Thayer’s two-dimensional emotion space model as in Figure 1 is well recognized as such space, in which the fundamental aspects of music mood are represented in a two-dimensional space of valence and arousal [3]. In the model, valence axis describes the continuous scale from pleasantness (happy, pleased, hopeful, etc.) to unpleasantness (unhappy, annoyed, hopeful, etc.). The arousal axis represents the degree of calming or exciting. Each axis spans the bipolar scale values from −1 to 1. The emotion is represented by a point or region [4] based on the values of coordinates, each of which corresponds to the positive or negative degrees of the feelings. Sometimes the mood categories can be designated in the two-dimensional emotion space as in Figure 1 [5].
The authors [2, 6] concludes that number of musical features like tempo, pitch, rhythm and harmony makes listener perceive and correctly identify a specific intended emotional expressions of mood like sad, fear, humorous, happy, and exiting. The study of lyric-based music mood recognition [7] has been performed to observe the relationship between content of lyrics and mood using OPM songs. KeyGraph keyword generation algorithm and word level features like TF-IDF have been examined by using various parameters and threshold to extract the keywords from lyrices. Automatic music mood recognition using support vector regression [8, 9] is studied by mapping the various music features into Thayer’s two-dimentional emotion space model which can later predict the values for valence and arousal. Similarly a 5-layer convolutional neural network (CNN) trained on Million Song dataset [10] for music tagging is used to extract the features to detect the music mood with respect to valence and arousal [11]. Similar audio network [
The deep neural network needs large amount of labeled dataset, but music information retrieval community (MIR) still have scarcity of such data especially for the music mood. Transfer learning also called inductive transfer is one of the way that can resolve the problem of limited number of labelled data set. Transfer learning reuses knowledge of the source domain to solve a new task in the target domain. For example, knowledge gained while recognize cars could apply when trying to recognize trucks [13]. Transfer learning has been applied successfully in many domains such as visual object recognition [14], web document classification [15, 16], and human action recognition [17]. There were several studies [18–21] in the field of machine learning using transfer learning approach. A natural RGB images to malignancy classification is used in [22] for contrast-enhanced MR images. The POS tagging on Penn Treebank data is used to transfer the knowledge of source task to target task POS tagging on microblogs with limited available annotation for sequence tagging data [23]. Similarly, the patterns of human epithelial type 2 (HEp-2) cells are studied by applying transfer learning from a pre-trained deep CNN to extract the generic features [24].
In this work, we extract the features of original and augmented EmoMusic dataset by reusing the layer-wise weights from the pre-trained network [11]. The pre-trained network is trained on Million Song dataset for music tagging. After extracting features, multilayer perceptron (MLP) with three or four hidden layers trained with 50% dropout is used to detect the scores in valence and arousal. In the experiment the number of data is increased by augmenting the original Emo-Music dataset using time stretching. The ground truth label for augmented data in valence and arousal dimension is set to be identical as original data. The experimental result of 10-fold cross-validation exhibits that MLP with the original EmoMusic dataset is similar to state-of-the-art accuracy in SOA(1) [11] and SOA(2) [25]. The highest
The organization of this paper is as follows. In Section 2, the methods of audio, feature extraction, mood detection using the MLP for regression and data augmentation method are described. Section 3 presents about the experimental results. Finally, we draw the conclusion in Section 4.
The audio file from the EmoMusic dataset is preprocessed using Librosa library [26] to generate the Mel-spectrogram. The single channel audio files amounts to the length of 29 seconds, with sampling rate of 12, 000 Hz. The size of FFT is 512 with the hop length of 256, which produces 1, 360 frames for a song. The number of Mel-bins is 96 so that the size of the generated Mel-spectrogram is 96×1360, which is fed into the pre-trained network for feature extraction.
The pre-trained CNN has the goal to automatically generate multiple tagging. The network has been trained with Million Song data set, which gives the tags among 50. In each layer this pre-trained convolutional network consists of consecutive 5 blocks, each of which has a set of 3×3 convolution filters, max-pooling operation, and batch normalization, and then followed by a fully connected layer for the multiple tagging [11]. Every unit in each layer has exponential linear unit (ELU) as activation function. The 50 units of the last fully connected layer have the sigmoidal activation functions to generate class-belongings corresponding to the tags.
Figure 2 shows the CNN after removing the fully connected layer for the tag generation which is nothing to do with the feature extraction. The structure of convolutional network for feature extraction is summarized in Table 1.
After preprocessing the 96 × 1360 Mel-spectrogram of Emo-Music dataset is fed into the pre-trained network to generate feature maps. Each feature maps has its own size and each units in a map has its own receptive field in Mel-frequency bin and time dimensions.
We assume that the whole music has consistently continued the same mood. Note that one can divide the whole music into several segments, try to detect the mood for each time segments, and summarize every mood of the segment to obtain the mood of whole music.
Because there are 32 channels with different set of filters in each layer, we can take 32 features by global average pooling. The 32 features from each of 5 layers can be separately exploited or concatenated to obtain a higher dimensional features. Therefore, the maximum dimension of the output feature is 160 when we concatenate all five layers.
Note that one can divide a feature map into several groups according to the Mel-frequency bin. In this case, we could take more features for a map. In the paper we do not consider the method, because there might be too huge number of features and it is necessary to consider the feature reduction technique.
After obtaining the features from the pre-trained network, the mood regressor is constructed by a MLP. By the empirical experiment we set three or four hidden layers, and one output layer as shown in Figure 3. The numbers of units in the hidden layers are also empirically set according to the data size in the experiment. The last layer has two nodes, each of which evaluates the mood in the Thayer’s two-dimensional scales of valence and arousal. Because the value of each bipolar dimension spans [−1, 1], the tangent hyperbolic is adapted as activation function [27]. We also construct same size MLP for both original and augmented data in our experiment. However, we didn’t mention the result because the amount of original data is not sufficient for huge size MLP so that resulting accuracy is decrease and network seems to be overfitted.
To avoid over fitting in the training, 50% of dropout [28] is applied to the units in each hidden layer. The network is trained with 10-fold cross-validation with stratified splits. Adam optimizer [29] is used for updating weight and bias to minimize the mean squared error (MSE) loss function. During training batch size of 100 and epochs of 300 are used.
As mentioned earlier there are 32 features from each layer of pre-trained network by taking the global average pooling. In the experiment, we compare the performance for the hierarchical features from each different layer of pre-trained CNN. In addition to the 5 sets of 32-dimensional features, a set of all the concatenated features from 5 layers is considered to compare the performance.
For the comparison, we use each of these six sets of features for training and obtain the percentage of
In the data-driven approach, the large number of data is essential to avoid overfitting. Therefore, the augmentation is required to artificially increase the number of data, especially when the number of data is not sufficient. Because the expensive psychological experiment is necessary to obtain the set of labelled data for mood detection, the size of labeled set is usually not so big for the data-driven approach.
The EmoMusic dataset has only 744 labelled data for mood regression. Therefore, we try to increase the number of Emo-Music dataset by time stretching with the modest scaling rates. Note that the too much shortened or stretched audio data can change pitch and tempo, which again makes the mood change differently from the original data. We use a randomly selected scaling factor in the small interval around 1.0 as [0.95, 1.05]. Note that the larger(less) than 1 scaling factor produces time stretched(shortened) audio data. After augmentation, we can have 3 × 744 new EmoMuisc data including two times of time stretched and the original data.
We use Librosa library for increasing the number of data by the time stretching. For time-stretching the EmoMusic dataset, at first, short time Fourier transform (STFT) of audio data is computed. Then we draw two randomly chosen stretching or shrinking rates from the uniform distribution in the interval.
In order to perform the time stretching, we take STFT of audio data. Then the STFT matrix of time-frequency representation is stretched or shorten based on the phase vocoder implemented in [30]. After the time stretching, the resulting time-frequency representation can be transformed back into the audio data in time dimension with the inverse STFT (ISTFT).
Each music in EmoMusic dataset used in our experiment has its own emotionally evaluated scores in two scales of valence and arousal in Thayer’s emotion space. There are eight genres of music included in EmoMusic data set including Blues, Electronic, Rock, Classical, Folk, Jazz, Country and Pop, which is annotated via crowdsourcing named Amazon’s Mechanical Turk.
The two types of dataset are used in our experiment as in Table 2. One is original EmoMusic dataset [31] and another is our augmented EmoMusic data. The original dataset consists of 744 audio data and another data set additionally includes the 2×744 time stretched data. The emotional score for augmented data is the same as original data, which is a pair of continuous values from 1 to 9 in arousal and valence scales. In order to adapt to tangent hyperbolic activation function, each of the score is normalized from −1 to 1.
The
In each experiments in Table 2,
The original data of EmoMusic contains only 744 excerpts of music which is limited with MLP for large number of hidden nodes. So, we take a small sized network in number of hidden layers and the number of nodes in each hidden layer. The structure of MLP from the original data is summarized in Table 3. The results of experiment 1 are shown in Figure 4. The percentage of
Table 4 summarizes the structure of MLP for experiment 2. We add an additional hidden layer and increase the number of nodes in the hidden layers, because the number of data is increased in the experiment.
The results of the experiment are shown in Figure 5. The highest scores of the mean over 10-folds are achieved by
From the results of our experiments we can conclude that higher layer features as well as assorted features of all layers are better than lower layer features. The box and whisker plot in Figures 4 and 5 defines the distribution of
The results form experiment 2 are smaller IQR and shorter length of two whiskers than those from experiment 1, which implies the augmented data by time stretching is valid to obtain stable mood detection in the MLP-based regressor. This results coincide well with the conclusion that deep learning and other modern nonlinear machine learning techniques can obtain better result with data augmentation [34, 35].
Considering the stable results from experiment 2, we can carefully conclude that the features from the higher layer provides better results in mood estimation of music. Usually, the CNN features in higher layer have wider receptive fields. This implies the features extracted from longer time durations are more effective to detect the mood of music.
We compare the mean over 10-fold results from
In Table 5, MLP with the augmented data performs the best scores of 79.64% and 85.61% in valence and arousal scales, respectively. Similarly, MLP with original data in experiment 1 also performs better result in valence, which is around 6% higher than SOA(1), and 2% higher than SOA(2). But the score in arousal is 3% less than SOA(2) and 2% higher than SOA(1).
This paper proposes an automatic mood detection of music with a composition of transfer learning and multilayer perceptron. The 5-layered CNN pre-trained on Million Song dataset is used to extract the features from EmoMusic dataset. We obtain a set of features from the different five layers, which is fed into MLP-based regression. Through the regression network we estimate the mood of music on Thayer’s two-dimensional emotion space, which consists of the axes corresponding to arousal and valence. Because the EmoMusic dataset does not provide enough number of data for training, we augment the data by time stretching to make it tripled. We perform the experiment with the augmented data as well as the original EmoMusic dataset. Box and whisker plot along with the mean of ten-fold cross-validation has been used for evaluating the proposed mood detection. In terms of the percentage of
Based upon our results we conclude that data augmentation technique plays a significant role for stable and efficient detection of music mood. In addition, the features taken from the convolution filters with larger receptive field is more effective for the mood detection in MLP-based regressor.
We believe that the performance can be further improved by the ensemble strategy of time segmented emotional evaluations, and the deliberate feature selection method.
No potential conflict of interest relevant to this article was reported.
The research leading to these result, authors would like to thank Korean Ministry of Education for funding. The National Research Foundation of Korea (NRF) support us under Basic Science Research Program (NRF-2015R1D1A1A01058062).
Thayer’s two-dimensional emotion space named arousal-valence space [
The structure of CNN for feature extraction.
Multilayer perceptron for mood detector (See
Results of MLP for experiment 1 of the original data.
Results of MLP for experiment 2 of augmented data.
Table 1 . Structure of CNN for feature extraction.
Layer | Max-pooling | Filter-size | Channels | Activation |
---|---|---|---|---|
1 | 2 × 4 | 3 × 3 | 32 | ELU |
2 | 4 × 4 | 3 × 3 | 32 | ELU |
3 | 4 × 5 | 3 × 3 | 32 | ELU |
4 | 2 × 4 | 3 × 3 | 32 | ELU |
5 | 4 × 4 | 3 × 3 | 32 | ELU |
Table 2 . EmoMusic and augmented data sets for experiment.
Experiments | Included data | No. of data |
---|---|---|
Experiment 1 | Original EmoMusic | 744 |
Experiment 2 | Original EmoMusic+Augmented data | 2232 |
Table 3 . MLP structure for experiment 1 of original data.
Layer | No. of nodes | Activation | Dropout (%) |
---|---|---|---|
Input | 32 or 60 | - | - |
Hidden 1 | 200 | ReLu | 50 |
Hidden 2 | 100 | ReLu | 50 |
Hidden 3 | 20 | ReLu | 50 |
Output | 2 | Tanh | - |
Table 4 . MLP structure for experiment 2 of augmented data.
Layer | No. of nodes | Activation | Dropout (%) |
---|---|---|---|
Input | 32 or 160 | - | - |
Hidden 1 | 1000 | ReLu | 50 |
Hidden 2 | 500 | ReLu | 50 |
Hidden 3 | 100 | ReLu | 50 |
Hidden 4 | 20 | ReLu | 50 |
Output | 2 | Tanh | - |
Table 5 . Comparison with SOA(1) and SOA(2).
Valence | Arousal | |
---|---|---|
MLP (original) | 51.88 | 67.07 |
MLP (augmented) | 79.64 | 85.61 |
SOA(1) | 45.72 | 65.31 |
SOA(2) | 50.0 | 70.0 |
Byoungjun Kim, and Joonwhoan Lee
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(4): 245-253 https://doi.org/10.5391/IJFIS.2018.18.4.245Yagya Raj Pandeya, and Joonwhoan Lee
Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160 https://doi.org/10.5391/IJFIS.2018.18.2.154Thayer’s two-dimensional emotion space named arousal-valence space [
The structure of CNN for feature extraction.
|@|~(^,^)~|@|Multilayer perceptron for mood detector (See
Results of MLP for experiment 1 of the original data.
|@|~(^,^)~|@|Results of MLP for experiment 2 of augmented data.