Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160
Published online June 25, 2018
https://doi.org/10.5391/IJFIS.2018.18.2.154
© The Korean Institute of Intelligent Systems
Yagya Raj Pandeya, and Joonwhoan Lee
Department of Computer Science and Engineering, Chonbuk National University, Jeonju, Korea
Correspondence to :
Joonwhoan Lee (chlee@jbnu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The domestic cat or house cats (Felis catus) are an ancient human pet animal that can deliver various alert message to human on environmental changes by its mysterious kinds of sounds generation capability. Cat sound classification using deep neural network had scarcity of labeled data, that impelled us to make CatSound dataset across 10 categories of sound. The dataset was even not enough to select data driven approach for end to end learning, so we choose transfer learning for feature extraction. Extracted feature are input to six various classifiers and ensemble techniques applied with predicted probabilities of all classifier results. The ensemble and data augmentation perform better in this research. Finally, various results are evaluated using confusion matrix and receiver operating characteristic curve.
Keywords: Labeled dataset, Transfer learning, Ensemble method, Data augmentation
Any pet animal’s sound can be very helpful for human beings in perspective to security or preprediction of natural disasters. The pet animal sound classification and recognition using deep learning techniques can be a growing area of research. The data driven approach for acoustic signal got high interest in recent year and some study also focus on animal sound classification [1, 2] and animal population identification based on their sound characteristics [3]. The marine mammal sound classification and its impact on marine life is studied on [4, 5]. Several study [6–12] are focused on bird sound identification, classification and challenges. The study in [13] perform the insect species classification based on their sound signals using machine learning techniques. The prediction of unusual animal sound behavior during earthquake and natural disaster is studied using machine learning techniques in [14]. Recently, the possibilities of transfer learning for bird sound identification using fundamental characteristics of music is explained in [15]. As far to our knowledge, there is not any detail study for pet animals sound, especially for domestic cat sound classification, even they are close friend of human beings.
In this paper our work particularly focuses on automatically classify unseen domestic cat sounds using transfer learning. This research was started with building a robust cat sound dataset that can able to handle the bio-diversity, species variation and age variation issues of domestic cat. In order to increase the generalization and robustness of deep neural network, we perform audio data augmentation as in [16]. The next step was feature extraction using pretrained neural network and successfully classify these features using popular machine learning techniques. Finally, we make comparative analysis our results from various classifier and predict the miss-classified cat sounds. The rest of paper is organized as follows: Section 2 describes the domestic cat sound dataset preparation and its challenges. Section 3 is an overview of transfer learning and classification of extracted features. The results and discussion are included in Section 5, followed by the evaluation in Section 6. Finally, the conclusion and future works are mentioned in Section 6.
Hearing is the second most important human sense after vision to recognize any animal. Automatic unseen cat sound recognition using deep neural network need large number of annotated data for successful training. It was one of great challenge for us to collect domestic cat sounds and separate the meaningful audio segment so that our network architectures could learn the proper semantics of audio. Finally, we were able to collect cat video and audio data files mostly from online source like YouTube (
The pie chart in Figure 1 illustrates cat sound dataset where each class are named according to the sound of cat in various moods. The number of cat sound files in each class are nearly 300, that is 10% of total data files, so we termed this CatSound dataset as a balanced dataset. CatSound dataset contain more than three hours of domestic cat sounds files with across 10 sound classes.
This section covers feature extraction of CatSound dataset using transfer learning and classification of these features using various classifiers in Python library ( 3
Mel-spectrograms are extracted from CatSound data in real-time on the GPU using Kapre [20]. The input has a single channel, 96-mel bins, and 1,360 temporal frames as described in [21]. To increase the data size and more generalize the cat sound feature, we use audio data augmentation as [16] by random selection of time stretching, pitch shifting, dynamic range compression, and insertion of noise. We augment our original dataset using LibROSA [22] making one to three augmented clones or copies of each original data files, named as 1
We use pre-trained five-layer convolutional neural network (CNN) of [21] as a feature extractor for CatSound dataset. This network was trained with Million Song dataset [23] and secure very good accuracy for music classification and regression. The cat sounds are very different from the studio recorded music because of the frequency variation, signal-to-noise ratio, sudden change in pitch and magnitude, frequent interruption of environmental sounds, bio-diversity of species, aging and animal mood. Even with these issues, the lack of enough labeled data made us to use pre-trained network as source network for transfer learning.
The experiment result shows that even the cat sound and music characteristics are very different, but every audio signal have some similar feature that are very helpful for small dataset if transfer learning is performed properly. The feature extraction and classification using pre-trained CNN architecture can be shown in Figure 2. The source network had total 32 feature in each layer and the global average pooling for each layer map them into 32-dimensional feature vector. Finally, we concatenate each layer feature and got 160-dimentional feature vector. These feature vectors are the input for our classifiers and voting the predicted probability for final ensembled result.
The comparative study of various classifier is mandatory to analyses diversified feature of cat sound. We select six popular classifiers to classify our dataset features that are briefly described. Random forest (RF) [24] classifier uses majority voting scheme (Bagging [25] or Boosting [26]) to make final prediction from a prespecified number of decision trees. K-nearest neighbor (KNN) [27] classifier find the class to which an unknown object belongs, by using majority voting of KNNs. Extremely randomized trees (or Extra Tree) [28] classifier tries to find optimal cut-point on entire given feature randomly. The linear discriminant analysis (LDA) [29] and quadratic discriminant analysis (QDA) [30] classifier are types of Bayesian classifiers that tries to maximize the component axis for class-separation. Support vector machine (SVM) [31] classifier with a radial basis function (RBF) is used to classify our extracted features non-linearly. The 10-fold cross-validation was performed during training the SVM. The cross-validation theory explains better performance on increasing fold until certain limit, but it also increases the computational cost. The research [32] suggest us to use 10-fold cross-validation for better performance of model’s hyperparameters in supervised learning. The comparative results of these classifiers are mentioned on Section 5.
Ensemble is a well-known machine learning technique that combines the prediction power of different classifiers and make the overall system more robust, as described in [33]. It can be done by majority voting or averaging the predicted probabilities of all classifiers of Section 3.3. We select averaging methods for ensemble our six classifiers described in Section 3.3. The ensemble technique is beneficial on all individual classifiers in all datasets (original plus augmented) and evaluation metrics described in Section 4. Table 1 and Figure 3 show the boost of system performance using ensemble on classifier perfected probability.
In evaluation the original and augmented CatSound dataset, the CNN features are divided into 90% of data for training our classifiers and remaining 10% for test these classifier predications. We find the influence of augmentation on our dataset boost classifier result and reduce the rate of confusion. To evaluate the performance of classifiers described in Section 3.3, accuracy, F-score (or F1-score) [34] and area under the receiver operating characteristic curve (ROC-AUC) scores [35] FVF are used. The accuracy metric refers to the percentage of correctly classified unknown data samples and F-score compute harmonic mean between precision and recall. The ROC-AUC scores are the prediction values of each curve in ROC curve [34] discussed in Section 5.
The best performing result using 3
The evaluation of classifiers results with and without augmentation scheme is presented in Figure 4. The illustration shows the boost of performance of classifiers after augmentation in different scale using on various evaluation metrics.
The confusion matrix is tabular representation of test result that enable us to further understand the strength and weakness of the network model and classifier. Figure 5 shows confusion matrix of our best classifier, QDA, in one times augmented dataset. Analysis of each confusion matrix represented by our six classifiers, we reach in some conclusions. The Resting (“Purring”) and Defense (“Hissing”) sounds are unique to other cat sound in CatSound dataset so these classes are relatively less confusing. In the other hand, the classes MotherCall (“pilling”) and HuntingMind (“trilling or chatting”), Happy (“meow-meow”) and Paining (“miyoou”), and Angry (“momo-mooh”) and Mating (“gay-gay-gay”) have some similarity in sound so we find these classes are very confusing for our all classifiers.
ROC curve is a graphical plot to illustrates the diagnostic ability of classifiers using two parameters (true positive rate [TPR] and false positive rate [FPR]) at various decision thresh-old. Each threshold value produces a different point in ROC space and the classifier produces positive result if its output is above that threshold. ROC curves are typically for binary classification so we binarize the output for our 10-class classification problem. The AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The ROC curve in Figure 3 showing six classifier curves and one averaging curve of all classifiers. The ROC-AUC scores of each curve are also illustrated in the same ROC curve.
Automatic bio-diversity estimation using animal sounds will be future area of research, but the data-driven approach need large number of labeled data. One way to full fill that scarcity can be the use of transfer learning. Even the source network trained with music data, the transfer learning will still beneficial for small animal sound dataset. The data augmentation, and majority voting (bagging or boosting) method can boost the system performance even using small sized animal sound dataset.
In future our research can be extended for better results by increasing the labeled dataset or selection of pre-trained network trained on animal sounds for transfer learning because the music data are very different from animal sounds.
No potential conflict of interest relevant to this article was reported.
The research leading to these result, authors would like to thank Korean Ministry of Education for funding. The National Research Foundation of Korea (NRF) support us under Basic Science Research Program (NRF-2015R1D1A1A01058062).
No potential conflict of interest relevant to this article was reported.
Overview of feature extraction from CNN network. From each layer of CNN, the globally averaged 32-dimensional features are concatenated into one feature vector and fed into various classifier. The predicted probability of each classifier is ensembled for final prediction result.
The average of accuracy, F1-score and area under curve comparison of our classifiers (six classifier and one ensemble) with original and augmented dataset.
Table 1. Best performance of classifier in 3
Classifiers | Accuracy (%) | F1-Score | AUC score |
---|---|---|---|
RF | 78.99 | 0.79 | 0.978 |
KNN | 79.07 | 0.79 | 0.884 |
Extra Trees | 77.30 | 0.77 | 0.977 |
LDA | 73.67 | 0.74 | 0.967 |
QDA | 80.76 | 0.81 | 0.974 |
SVM | 78.57 | 0.78 | 0.978 |
Ensemble | 87.76 | 0.88 | 0.990 |
RF | 78.99 | 0.79 | 0.978 |
E-mail: yagyapandeya@gmail.com
E-mail: chlee@jbnu.ac.kr
Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160
Published online June 25, 2018 https://doi.org/10.5391/IJFIS.2018.18.2.154
Copyright © The Korean Institute of Intelligent Systems.
Yagya Raj Pandeya, and Joonwhoan Lee
Department of Computer Science and Engineering, Chonbuk National University, Jeonju, Korea
Correspondence to:Joonwhoan Lee (chlee@jbnu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The domestic cat or house cats (Felis catus) are an ancient human pet animal that can deliver various alert message to human on environmental changes by its mysterious kinds of sounds generation capability. Cat sound classification using deep neural network had scarcity of labeled data, that impelled us to make CatSound dataset across 10 categories of sound. The dataset was even not enough to select data driven approach for end to end learning, so we choose transfer learning for feature extraction. Extracted feature are input to six various classifiers and ensemble techniques applied with predicted probabilities of all classifier results. The ensemble and data augmentation perform better in this research. Finally, various results are evaluated using confusion matrix and receiver operating characteristic curve.
Keywords: Labeled dataset, Transfer learning, Ensemble method, Data augmentation
Any pet animal’s sound can be very helpful for human beings in perspective to security or preprediction of natural disasters. The pet animal sound classification and recognition using deep learning techniques can be a growing area of research. The data driven approach for acoustic signal got high interest in recent year and some study also focus on animal sound classification [1, 2] and animal population identification based on their sound characteristics [3]. The marine mammal sound classification and its impact on marine life is studied on [4, 5]. Several study [6–12] are focused on bird sound identification, classification and challenges. The study in [13] perform the insect species classification based on their sound signals using machine learning techniques. The prediction of unusual animal sound behavior during earthquake and natural disaster is studied using machine learning techniques in [14]. Recently, the possibilities of transfer learning for bird sound identification using fundamental characteristics of music is explained in [15]. As far to our knowledge, there is not any detail study for pet animals sound, especially for domestic cat sound classification, even they are close friend of human beings.
In this paper our work particularly focuses on automatically classify unseen domestic cat sounds using transfer learning. This research was started with building a robust cat sound dataset that can able to handle the bio-diversity, species variation and age variation issues of domestic cat. In order to increase the generalization and robustness of deep neural network, we perform audio data augmentation as in [16]. The next step was feature extraction using pretrained neural network and successfully classify these features using popular machine learning techniques. Finally, we make comparative analysis our results from various classifier and predict the miss-classified cat sounds. The rest of paper is organized as follows: Section 2 describes the domestic cat sound dataset preparation and its challenges. Section 3 is an overview of transfer learning and classification of extracted features. The results and discussion are included in Section 5, followed by the evaluation in Section 6. Finally, the conclusion and future works are mentioned in Section 6.
Hearing is the second most important human sense after vision to recognize any animal. Automatic unseen cat sound recognition using deep neural network need large number of annotated data for successful training. It was one of great challenge for us to collect domestic cat sounds and separate the meaningful audio segment so that our network architectures could learn the proper semantics of audio. Finally, we were able to collect cat video and audio data files mostly from online source like YouTube (
The pie chart in Figure 1 illustrates cat sound dataset where each class are named according to the sound of cat in various moods. The number of cat sound files in each class are nearly 300, that is 10% of total data files, so we termed this CatSound dataset as a balanced dataset. CatSound dataset contain more than three hours of domestic cat sounds files with across 10 sound classes.
This section covers feature extraction of CatSound dataset using transfer learning and classification of these features using various classifiers in Python library ( 3
Mel-spectrograms are extracted from CatSound data in real-time on the GPU using Kapre [20]. The input has a single channel, 96-mel bins, and 1,360 temporal frames as described in [21]. To increase the data size and more generalize the cat sound feature, we use audio data augmentation as [16] by random selection of time stretching, pitch shifting, dynamic range compression, and insertion of noise. We augment our original dataset using LibROSA [22] making one to three augmented clones or copies of each original data files, named as 1
We use pre-trained five-layer convolutional neural network (CNN) of [21] as a feature extractor for CatSound dataset. This network was trained with Million Song dataset [23] and secure very good accuracy for music classification and regression. The cat sounds are very different from the studio recorded music because of the frequency variation, signal-to-noise ratio, sudden change in pitch and magnitude, frequent interruption of environmental sounds, bio-diversity of species, aging and animal mood. Even with these issues, the lack of enough labeled data made us to use pre-trained network as source network for transfer learning.
The experiment result shows that even the cat sound and music characteristics are very different, but every audio signal have some similar feature that are very helpful for small dataset if transfer learning is performed properly. The feature extraction and classification using pre-trained CNN architecture can be shown in Figure 2. The source network had total 32 feature in each layer and the global average pooling for each layer map them into 32-dimensional feature vector. Finally, we concatenate each layer feature and got 160-dimentional feature vector. These feature vectors are the input for our classifiers and voting the predicted probability for final ensembled result.
The comparative study of various classifier is mandatory to analyses diversified feature of cat sound. We select six popular classifiers to classify our dataset features that are briefly described. Random forest (RF) [24] classifier uses majority voting scheme (Bagging [25] or Boosting [26]) to make final prediction from a prespecified number of decision trees. K-nearest neighbor (KNN) [27] classifier find the class to which an unknown object belongs, by using majority voting of KNNs. Extremely randomized trees (or Extra Tree) [28] classifier tries to find optimal cut-point on entire given feature randomly. The linear discriminant analysis (LDA) [29] and quadratic discriminant analysis (QDA) [30] classifier are types of Bayesian classifiers that tries to maximize the component axis for class-separation. Support vector machine (SVM) [31] classifier with a radial basis function (RBF) is used to classify our extracted features non-linearly. The 10-fold cross-validation was performed during training the SVM. The cross-validation theory explains better performance on increasing fold until certain limit, but it also increases the computational cost. The research [32] suggest us to use 10-fold cross-validation for better performance of model’s hyperparameters in supervised learning. The comparative results of these classifiers are mentioned on Section 5.
Ensemble is a well-known machine learning technique that combines the prediction power of different classifiers and make the overall system more robust, as described in [33]. It can be done by majority voting or averaging the predicted probabilities of all classifiers of Section 3.3. We select averaging methods for ensemble our six classifiers described in Section 3.3. The ensemble technique is beneficial on all individual classifiers in all datasets (original plus augmented) and evaluation metrics described in Section 4. Table 1 and Figure 3 show the boost of system performance using ensemble on classifier perfected probability.
In evaluation the original and augmented CatSound dataset, the CNN features are divided into 90% of data for training our classifiers and remaining 10% for test these classifier predications. We find the influence of augmentation on our dataset boost classifier result and reduce the rate of confusion. To evaluate the performance of classifiers described in Section 3.3, accuracy, F-score (or F1-score) [34] and area under the receiver operating characteristic curve (ROC-AUC) scores [35] FVF are used. The accuracy metric refers to the percentage of correctly classified unknown data samples and F-score compute harmonic mean between precision and recall. The ROC-AUC scores are the prediction values of each curve in ROC curve [34] discussed in Section 5.
The best performing result using 3
The evaluation of classifiers results with and without augmentation scheme is presented in Figure 4. The illustration shows the boost of performance of classifiers after augmentation in different scale using on various evaluation metrics.
The confusion matrix is tabular representation of test result that enable us to further understand the strength and weakness of the network model and classifier. Figure 5 shows confusion matrix of our best classifier, QDA, in one times augmented dataset. Analysis of each confusion matrix represented by our six classifiers, we reach in some conclusions. The Resting (“Purring”) and Defense (“Hissing”) sounds are unique to other cat sound in CatSound dataset so these classes are relatively less confusing. In the other hand, the classes MotherCall (“pilling”) and HuntingMind (“trilling or chatting”), Happy (“meow-meow”) and Paining (“miyoou”), and Angry (“momo-mooh”) and Mating (“gay-gay-gay”) have some similarity in sound so we find these classes are very confusing for our all classifiers.
ROC curve is a graphical plot to illustrates the diagnostic ability of classifiers using two parameters (true positive rate [TPR] and false positive rate [FPR]) at various decision thresh-old. Each threshold value produces a different point in ROC space and the classifier produces positive result if its output is above that threshold. ROC curves are typically for binary classification so we binarize the output for our 10-class classification problem. The AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The ROC curve in Figure 3 showing six classifier curves and one averaging curve of all classifiers. The ROC-AUC scores of each curve are also illustrated in the same ROC curve.
Automatic bio-diversity estimation using animal sounds will be future area of research, but the data-driven approach need large number of labeled data. One way to full fill that scarcity can be the use of transfer learning. Even the source network trained with music data, the transfer learning will still beneficial for small animal sound dataset. The data augmentation, and majority voting (bagging or boosting) method can boost the system performance even using small sized animal sound dataset.
In future our research can be extended for better results by increasing the labeled dataset or selection of pre-trained network trained on animal sounds for transfer learning because the music data are very different from animal sounds.
No potential conflict of interest relevant to this article was reported.
The research leading to these result, authors would like to thank Korean Ministry of Education for funding. The National Research Foundation of Korea (NRF) support us under Basic Science Research Program (NRF-2015R1D1A1A01058062).
CatSound dataset class representation.
Overview of feature extraction from CNN network. From each layer of CNN, the globally averaged 32-dimensional features are concatenated into one feature vector and fed into various classifier. The predicted probability of each classifier is ensembled for final prediction result.
ROC curve of the best performing classifiers with 3
The average of accuracy, F1-score and area under curve comparison of our classifiers (six classifier and one ensemble) with original and augmented dataset.
Confusion matrix of the best performing ensemble classifier with 3
Table 1 . Best performance of classifier in 3
Classifiers | Accuracy (%) | F1-Score | AUC score |
---|---|---|---|
RF | 78.99 | 0.79 | 0.978 |
KNN | 79.07 | 0.79 | 0.884 |
Extra Trees | 77.30 | 0.77 | 0.977 |
LDA | 73.67 | 0.74 | 0.967 |
QDA | 80.76 | 0.81 | 0.974 |
SVM | 78.57 | 0.78 | 0.978 |
Ensemble | 87.76 | 0.88 | 0.990 |
RF | 78.99 | 0.79 | 0.978 |
Ho-Seung Kim and Jee-Hyong Lee
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92 https://doi.org/10.5391/IJFIS.2024.24.2.83Alif Tri Handoyo, Hidayaturrahman, Criscentia Jessica Setiadi, Derwin Suhartono
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 401-413 https://doi.org/10.5391/IJFIS.2022.22.4.401Bhuwan Bhattarai, and Joonwhoan Lee
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(2): 88-96 https://doi.org/10.5391/IJFIS.2019.19.2.88CatSound dataset class representation.
|@|~(^,^)~|@|Overview of feature extraction from CNN network. From each layer of CNN, the globally averaged 32-dimensional features are concatenated into one feature vector and fed into various classifier. The predicted probability of each classifier is ensembled for final prediction result.
|@|~(^,^)~|@|ROC curve of the best performing classifiers with 3
The average of accuracy, F1-score and area under curve comparison of our classifiers (six classifier and one ensemble) with original and augmented dataset.
|@|~(^,^)~|@|Confusion matrix of the best performing ensemble classifier with 3