International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(1): 1-10
Published online March 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.1.1
© The Korean Institute of Intelligent Systems
Tosin Akinwale Adesuyi1, Byeong Man Kim1 , and Jongwan Kim2
1Department of Computer and Software Engineering, Kumoh National Institute of Technology, Gumi, Korea
2Division of Computer and Information Engineering, Daegu University, Gyeongsan, Korea
Correspondence to :
Byeong Man Kim (bmkim@kumoh.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sound is an essential element of human relationships and communication. The sound recognition process involves three phases: signal preprocessing, feature extraction, and classification. This paper describes research on the classification of snoring data used to determine the importance of sleep health in humans. However, current sound classification methods using deep learning approaches do not yield desirable results for building good models. This is because some of the salient features required to sufficiently discriminate sounds and improve the accuracy of the classification are poorly captured during training. In this study, we propose a new convolutional neural network (CNN) model for sound classification using multi-feature extraction. The extracted features were used to form a new dataset that was used as the input to the CNN. Experiments were conducted on snoring and non-snoring datasets. The accuracy of the proposed model was 99.7% for snoring sounds, demonstrating an almost perfect classification and superior results compared to existing methods.
Keywords: Sound recognition, Snoring sound, CNN, Multi-feature extraction
Sound can be described as pressure caused by the vibration of an object. Sound is an important aspect in human relationships and communication. Recent research on sound recognition has received considerable attention because it is used in various fields such as bio-acoustic monitoring, wild animal intrusion detection, environmental sound, and audio surveillance [1]. The sound-recognition process includes three steps: signal preprocessing, feature extraction, and classification [2]. This study is focused on snoring sound classification because snoring has a significant impact on sleep health.
Sleep is unique to humans and animals. A phenomenon called snoring occurs when there is a sleep disorder. Snoring is classified as primary snoring, which involves no comorbidity, and obstructive sleep apnea (OSA), which poses health risks [3]. Snoring noise is estimated to reach approximately 90 to 100 dB, which can cause hearing loss in people next to the snoring person [4]. OSA is a sleep-related respiratory disorder that repeatedly causes a partial or complete obstruction of the upper airways during sleep [5]. An easy way to diagnose OSA is to use polysomnography; however, this method is expensive [6]. Signal processing and artificial intelligence (AI) are the most promising approaches among the existing methods for diagnosing snoring, as in the case of apnea [7]. In the AI field, classifiers such as a support vector machine (SVM), K-nearest neighbor (KNN), recurrent neural network (RNN), and convolutional neural network (CNN) are appropriate solutions. However, these classifiers rely on the feature extraction techniques used in signal processing to extract features from the snoring audio signal to conduct an accurate classification. In this case, snoring data and feature extraction are the most important factors because they are required to train the classifier. Several approaches to snoring dataset acquisition and frequently used feature extraction methods are described in [7]. According to the dataset made available by the INTERSPEECH 2017 ComParE Snoring Challenge [8], OSA is categorized into four classes based on the source of obstruction that leads to the snoring: velum (V), oropharyngeal (O), tongue (T), and epiglottis (E). Distinguishing background noise from sleep sounds (e.g., sleep talking), which are considered noise from real snoring sounds, is a subject that needs to be addressed. Addressing this challenge using a deep learning model will facilitate a more accurate classification/detection that distinguishes snoring sounds from non-snoring sounds. The authors in [9] stressed that this step is necessary but insufficient and further emphasized the need for the right choice of feature extraction techniques to acquire a desirable result.
In this paper, we presented a one-dimensional (1D) CNN for snoring and non-snoring sound classification based on several signal processing approaches for feature extraction. The main contribution of this study is to overcome the challenges mentioned above and provide a deep learning model that serves as an alternative method for distinguishing snoring sounds with an extremely high classification accuracy. The contributions of this study are summarized as follows:
We generated spectrogram images from our dataset and fed them into the 2D-CNN network; however, the resulting accuracy was poor.
We extracted unique features from our snoring/non-snoring sound dataset according to their importance in boosting the classification accuracy.
The extracted features served as a new dataset (numerical data) and were fed into a 1D-CNN network. The resulting output was significant.
Finally, we present the research results obtained by comparing and evaluating the proposed method with other existing approaches.
The rest of this paper is organized as follows: Section 2 describes previous related studies, including the feature extraction process and classification methods. Sections 3 and 4 present the proposed approach and experiment results and discussion, respectively. Finally, Section 5 provides some concluding remarks regarding this study.
To achieve this goal, a detailed review of related studies was conducted. First, we describe the snoring data acquisition methods. Based on previous research [7], snoring data sources can be categorized into four categories: online available snore sound corpora, snore data provided by medical organizations, snore data from subjects, and crowdsourced snore data. A typical example of an online snoring sound corpus is the Munich-Passau Snore Sound Corpus, which consists of 828 snore audio samples grouped into four categories: V, O, T, and E [8]. Some researchers have collected snore data from patients with apnea in hospitals or medical organizations [10]. Snoring audio can be captured from people who have consented to be subjected to sleep or other forms of sleep-induced substances to record snoring activity [11]. Finally, there is a process of collecting/aggregating snoring and related snore data from different online sources [7]. In this study, we plan to use the third method to collect snoring data directly from the subjects.
Conventional classification methods include SVM, KNN, CNN, RNN, and their hybrids. An SVM was originally developed to classify the data into two classes. According to the authors in [12], it is regarded as a technique used for the classification of linear and nonlinear datasets. A KNN searches for an
Figure 1 shows the generic architecture for the snoring sound classification process [7]. In previous studies, snoring signals were converted into spectrogram images, and feature extraction techniques were then applied. Finally, the extracted features were passed into a classifier for classification.
Several approaches have been proposed for classifying snoring audio sounds. Many of these approaches are based on either the application of feature extraction techniques along with classifier(s), or the direct use of classifier(s) alone. A new rapidly developing field called music information retrieval (MIR), which encompasses musicology, digital signal processing, and machine learning (ML) [15], has recently gained the interest of researchers. In the literature, the approaches used in the field of MIR are efficient for extracting salient features from sound signals. These include Mel-frequency cepstral coefficient (MFCC), RollOff, short-time Fourier transform (STFT), spectral features, zero-crossing rate (ZCR), local binary pattern (LBP), histogram of oriented gradient (HOG), and pulse transition time (PTT). In this study, we combined several of these approaches to extract features from our data before training with a CNN. The results obtained prove the efficiency and effectiveness of MIR techniques.
In recent years, deep neural networks such as AlexNet and region-based CNN (R-CNNs) have been proposed to overcome the limitations of classic neural networks [16]. In this study, a new CNN was designed to classify sounds using the extracted feature sets. The CNN architecture used to classify the snoring and non-snoring feature datasets consists of four convolutional layers, a max-pooling layer, a dropout, and a softmax activation function.
We propose a 1D-CNN model for snoring sound classification based on multi-feature extraction. The essence of the multi-feature extraction process is the extraction of salient features that distinctly distinguish sound signals from closely related signals, which may not achievable using a single extraction technique. In this study, seven features were applied: STFT, root mean square (RMS), spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The overall architecture of the proposed system is shown in Figure 2.
The authors in [7] identified the four most common categories of snoring data acquisition as a publicly available snore sound corpus, snore data provided by a medical organization, snore data created through subjects, and crowdsourced snore data. We adopted the third category by creating snoring data from six individuals [17], i.e., males between 25 and 30 years in age. Two Samsung Galaxy S8 phones were placed 1 m away from the head of the subjects, and recording was conducted during active sleeping hours between 11:30 pm and 5:30 am for two nights. All recorded snoring audio clips were divided into sections having a length of 40 second and sampled at 44.1 kHz; hence, a total of 1,080 snoring episodes were generated. However, owing to the presence of sections without audio, the size of the snoring episodes was reduced to 881. A total of 880 segmented non-snoring audios, including clock ticks, sleep-talk, yawning, pet snoring, door sound, and bed spring sound, were introduced, and snoring audios were labeled as 1, whereas the non-snoring audios were labeled as 0, forming a total data size of 1,761 [17].
A graphical image of the data acquisition is shown in Figure 3. Next, we transformed the audio into spectrogram images and built a 2D-CNN to train them. However, the results were not encouraging. Hence, we proceeded to extract features from the audio dataset, resulting in a new set of data. This feature set led to the use a 1D-CNN network because the dataset contained numeric values in one-dimension. The Excel chart at the bottom of Figure 3 shows the cross-section of the extracted features.
Seven feature extraction approaches were used to generate the feature dataset, as shown in Figure 3. This feature set was fed into a 1D-CNN for classification. The energy of an audio signal is referred to as the total magnitude or loudness of the signal
The RMS is used to characterize the average of a continuously varying audio signal, and its value represents the total waveform. The RMS energy for each frame of the audio signal is defined as
where
The MFCC of a signal is a set of audio signal features that provide a descriptive overview of a spectral envelop [7]. The number of MFCCs was set to 20 and the discrete cosine transform (DCT) type was set to 2. DCT is a function that logs energy from the filter bank and transforms it from the frequency domain to the time domain. In addition, the Mel filter bank estimates the amount of energy that exists at low frequencies.
The Mel frequency,
where
According to the authors of [18], the STFT of a signal for time
where
The spectral centroid identifies the point where the center of mass of an audio sound exists by estimating the weighted mean of the frequencies within the audio signal. In the computation of the spectral centroid, each frame magnitude signal is normalized as follows:
where
The spectral bandwidth expresses the spectral spread of an audio signal. The spectral bandwidth at frame
where
The spectral RollOff frequency is an audio feature that specifies the frequency that contains the majority of the total energy spectrum. According to the Librosa (a Python library for analyzing music and audio) documentation, distinct frames are defined as the center frequency for a spectrogram bin that contains at least 85% of the energy of the spectrum in a frame [19].
The ZCR is the rate of change of the signal sign from positive to negative, or vice versa. The ZCR splits a signal into
where
These seven features are representative features frequently used in signal processing.
We designed a new CNN model to classify snoring data based on multi-feature extraction. To achieve this goal, we utilized TensorFlow, an open-source library developed by Google that primarily focuses on deep learning. It has a comprehensive and flexible ecosystem of tools, libraries, and community resources that allows researchers and developers to easily build and deploy state-of-the-art ML-powered applications. TensorFlow offers multiple levels of abstraction, building, and training models using the high-level Keras API, which makes it easy to get started with TensorFlow and ML. Together with TensorFlow 2.0, Keras has been adopted than any other deep-learning solutions. We used the Keras and TensorFlow libraries for the 1D-CNN and 2D-CNN experiments, respectively. In this study, experiments were conducted using 2D-CNN and 1D-CNN training approaches with different hardware system configurations. We utilized Intel Core i7@3.60 GHz CPU and 32 GB RAM for the 1D-CNN, and another system with a hardware configuration of a single NVIDIA GTX 1080Ti 16 GB GPU, Intel Core i7@3.60 GHz CPU and 32 GB RAM, as in the case of the 2D-CNN. Figure 5 shows conceptual diagrams of the 1D-CNN and 2D-CNN models used in this study.
Initially, we experimented by converting our audio dataset into spectrogram images. The resulting images were then sent to a 2D-CNN classifier. A typical architecture of the proposed 2D-CNN classifier is shown in Figure 6. The input size of the classifier was 32×32×3, and the three convolution layers had ReLU activations. In front of each layer, a max-pooling layer reduces the spatial dimension. In addition, the classifier also had two fully connected layers, and batch normalization was used after each of these layers. We applied a softmax activation function to the second fully connected batch normalization layers. The Adam optimizer was also applied to optimize the loss based on the output of the cross-entropy loss function. Both the number of epochs and batch sizes were 100, and the learning rate was 0.0005. Consequently, an accuracy of 86.80% was achieved, as shown in Figure 7. However, their performance was unsatisfactory. Therefore, we attempted to improve the results using the feature extraction process.
We applied multi-feature extraction technology to significantly improve the classification accuracy. At the end of this process, the extracted feature data are forwarded to the 1D-CNN classifier, as shown in Figure 8. The generated feature data consisted of 26 columns and 1,761 rows. Feature data were divided into 80% for training and 20% for testing. In the proposed 1D-CNN shown in Figure 7, the input shape of 26×1 was converted into 13×2. Four convolutional layers were used, with two fully connected layers (density). The first convolutional layer (Conv1D 1) had 256 filters, a kernel size of 2, and a ReLU activation function. In the second convolution layer, 128 filters were used, and the activation function and kernel size were identical to those in the first layer. The second layer is a max-pooling layer with a window size of 2 to reduce the spatial dimensions of the input entering the third convolutional layer. The third and fourth convolution layers (Conv1D 3 and Conv1D 4) used the same filter size of 64, kernel size of 2, and ReLU activation function. After the fourth layer, we introduced a dropout mechanism by setting 50% of the active neurons in the network to zero, which helped prevent an overfitting during feature data training and increased the classification accuracy. Subsequently, the layer was flattened to obtain a fully connected layer (dense 1) with 32 neurons and ReLU activation. The second dense layer is the output layer with two neurons using a softmax activation function. The cross-entropy loss function is defined to calculate the cost using
where
As shown in Figure 7, the proposed network achieved an accuracy of more than 95% after 20 epochs, achieving 99.7% accuracy at the end of 100 epochs. The multi-feature approach used to extract the features can differentiate between snoring and non-snoring data and consequently pave the way for 1D-CNN classifiers to achieve such a high classification accuracy. Moreover, the architecture of the classifier, including the parameter settings, also contributes significantly to the results. Based on these results, a deep learning model was created to detect snoring/OSA with a high classification accuracy in the presence of noise disturbances. Table 1 lists the variation in accuracy when there were few combined features. However, the MFCC had an extremely large impact depending on the results achieved, whereas the spectral features did not significantly affect the classification accuracy. Nonetheless, the combination of these seven features provided the best overall accuracy.
Table 2 presents the classification results of the existing research on snoring data [6,9–11,20–23]. From Table 2, it is evident that the multi-feature extraction techniques (ZCR+STFT+MFCC) along with the RNN classifier yielded the best result of 98.8% with one of the largest data sizes [7]. Compared with the highest result of 98.8% in Table 2, it can be seen that the accuracy of the model using the proposed MFCC+ZCR+STFT+Spectral+ RMS and 1D-CNN presented in Table 1 is 99.7%, which is a 0.9% improvement. When considering the error rate instead of the accuracy, the 1.2% error rate was reduced to 0.3%, showing that the error rate was reduced to only one-fourth. This confirms that the proposed method is highly effective in classifying data on snoring, which is a precursor to a sleep apnea transition. By contrast, the absence of a feature extraction technique may seriously affect the result of a classifier and thus reduce the classification accuracy, as cited in [22].
We presented a multi-feature extraction approach for the classification of snoring sounds using a CNN classifier. The features include the STFT, RMS, spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The features extracted from the sound dataset were used to form a feature dataset for training. Therefore, a 1D-CNN and 2D-CNN were designed to classify the sound dataset using the extracted feature sets. The experiment results showed that the proposed 1D-CNN achieved accuracies of 99.7% for snoring sounds. This proves that the proposed method is useful for identifying snoring sound datasets. Our study provides a multi-feature extraction process that has the potential to extract essential features that can produce a high classification accuracy from audio sound data. The snoring CNN model can be adopted as an alternative and inexpensive method for diagnosing or detecting snoring/OSA. In addition, the snoring data acquisition approach can be adopted to capture audio datasets in the fields of signal processing, medical and musical information retrieval.
In the future, we plan to expand the field of application by applying the proposed method to other datasets, such as speech, environmental, and animal sounds. For example, cat sound classification research has been previously conducted in Korea [24]. However, it is not easy to apply a deep learning approach to the classification of cat sounds owing to data sparsity. In addition, the data collection method and the proposed feature extraction method in a non-face-to-face environment caused by coronavirus disease 2019 (COVID-19) are expected to be used in various ways in the work-from-home environment with sound data.
No potential conflict of interest relevant to this article was reported.
Classification accuracy for snoring and non-snoring using multi-feature techniques and spectrogram image.
Table 1. Variation in accuracy of features using the proposed 1D-CNN on snoring dataset.
Feature | Accuracy (%) |
---|---|
Spectral (Centroid+Bandwidth+RollOff) | 50.18 |
STFT+RMS+Spectral+ZCR | 60.79 |
STFT+RMS+Spectral | 89.46 |
MFCC | 99.00 |
MFCC+ZCR | 99.00 |
Spectral+ZCR+MFCC | 99.20 |
MFCC+ZCR+STFT+Spectral+RMS | 99.70 |
Table 2. Classification results in existing studies on snoring data.
Feature extraction technique | Classifier | Data size | Test accuracy (%) | ||
---|---|---|---|---|---|
Subject | Training | ||||
Demir et al. [20] | LBP+HOG | SVM | - | 828 | 72.00 |
Lim et al. [11] | ZCR+STFT+MFCC | RNN | 8 | 5600 | 98.80 |
Kang et al. [6] | MFCC | CNN+LSTM | 24 | 24 | 88.28 |
Arsenali et al. [9] | MFCC | RNN | 20 | 5670 | 95.00 |
Khan [21] | MFCC | CNN | - | 1000 | 96.00 |
Wang et al. [22] | - | Dual CNN+GRU | - | 828 | 63.80 |
Tuncer et al. [10] | PTT signal+AlexNet+VGG16 | SVM+KNN | 100 | 100 | 92.78 |
Dalal and Triggs [23] | SCAT+GMM+MAP | MLP | 224 | 282 | 67.71 |
GRU, gated recurrent unit; SCAT, deep scattering spectrum; GMM, Gaussian mixture model; MAP, maximum a posteriori; MLP, multilayer perceptron.
E-mail: atadesuyi@kumoh.ac.kr
E-mail: bmkim@kumoh.ac.kr
E-mail: jwkim@daegu.ac.kr
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(1): 1-10
Published online March 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.1.1
Copyright © The Korean Institute of Intelligent Systems.
Tosin Akinwale Adesuyi1, Byeong Man Kim1 , and Jongwan Kim2
1Department of Computer and Software Engineering, Kumoh National Institute of Technology, Gumi, Korea
2Division of Computer and Information Engineering, Daegu University, Gyeongsan, Korea
Correspondence to:Byeong Man Kim (bmkim@kumoh.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sound is an essential element of human relationships and communication. The sound recognition process involves three phases: signal preprocessing, feature extraction, and classification. This paper describes research on the classification of snoring data used to determine the importance of sleep health in humans. However, current sound classification methods using deep learning approaches do not yield desirable results for building good models. This is because some of the salient features required to sufficiently discriminate sounds and improve the accuracy of the classification are poorly captured during training. In this study, we propose a new convolutional neural network (CNN) model for sound classification using multi-feature extraction. The extracted features were used to form a new dataset that was used as the input to the CNN. Experiments were conducted on snoring and non-snoring datasets. The accuracy of the proposed model was 99.7% for snoring sounds, demonstrating an almost perfect classification and superior results compared to existing methods.
Keywords: Sound recognition, Snoring sound, CNN, Multi-feature extraction
Sound can be described as pressure caused by the vibration of an object. Sound is an important aspect in human relationships and communication. Recent research on sound recognition has received considerable attention because it is used in various fields such as bio-acoustic monitoring, wild animal intrusion detection, environmental sound, and audio surveillance [1]. The sound-recognition process includes three steps: signal preprocessing, feature extraction, and classification [2]. This study is focused on snoring sound classification because snoring has a significant impact on sleep health.
Sleep is unique to humans and animals. A phenomenon called snoring occurs when there is a sleep disorder. Snoring is classified as primary snoring, which involves no comorbidity, and obstructive sleep apnea (OSA), which poses health risks [3]. Snoring noise is estimated to reach approximately 90 to 100 dB, which can cause hearing loss in people next to the snoring person [4]. OSA is a sleep-related respiratory disorder that repeatedly causes a partial or complete obstruction of the upper airways during sleep [5]. An easy way to diagnose OSA is to use polysomnography; however, this method is expensive [6]. Signal processing and artificial intelligence (AI) are the most promising approaches among the existing methods for diagnosing snoring, as in the case of apnea [7]. In the AI field, classifiers such as a support vector machine (SVM), K-nearest neighbor (KNN), recurrent neural network (RNN), and convolutional neural network (CNN) are appropriate solutions. However, these classifiers rely on the feature extraction techniques used in signal processing to extract features from the snoring audio signal to conduct an accurate classification. In this case, snoring data and feature extraction are the most important factors because they are required to train the classifier. Several approaches to snoring dataset acquisition and frequently used feature extraction methods are described in [7]. According to the dataset made available by the INTERSPEECH 2017 ComParE Snoring Challenge [8], OSA is categorized into four classes based on the source of obstruction that leads to the snoring: velum (V), oropharyngeal (O), tongue (T), and epiglottis (E). Distinguishing background noise from sleep sounds (e.g., sleep talking), which are considered noise from real snoring sounds, is a subject that needs to be addressed. Addressing this challenge using a deep learning model will facilitate a more accurate classification/detection that distinguishes snoring sounds from non-snoring sounds. The authors in [9] stressed that this step is necessary but insufficient and further emphasized the need for the right choice of feature extraction techniques to acquire a desirable result.
In this paper, we presented a one-dimensional (1D) CNN for snoring and non-snoring sound classification based on several signal processing approaches for feature extraction. The main contribution of this study is to overcome the challenges mentioned above and provide a deep learning model that serves as an alternative method for distinguishing snoring sounds with an extremely high classification accuracy. The contributions of this study are summarized as follows:
We generated spectrogram images from our dataset and fed them into the 2D-CNN network; however, the resulting accuracy was poor.
We extracted unique features from our snoring/non-snoring sound dataset according to their importance in boosting the classification accuracy.
The extracted features served as a new dataset (numerical data) and were fed into a 1D-CNN network. The resulting output was significant.
Finally, we present the research results obtained by comparing and evaluating the proposed method with other existing approaches.
The rest of this paper is organized as follows: Section 2 describes previous related studies, including the feature extraction process and classification methods. Sections 3 and 4 present the proposed approach and experiment results and discussion, respectively. Finally, Section 5 provides some concluding remarks regarding this study.
To achieve this goal, a detailed review of related studies was conducted. First, we describe the snoring data acquisition methods. Based on previous research [7], snoring data sources can be categorized into four categories: online available snore sound corpora, snore data provided by medical organizations, snore data from subjects, and crowdsourced snore data. A typical example of an online snoring sound corpus is the Munich-Passau Snore Sound Corpus, which consists of 828 snore audio samples grouped into four categories: V, O, T, and E [8]. Some researchers have collected snore data from patients with apnea in hospitals or medical organizations [10]. Snoring audio can be captured from people who have consented to be subjected to sleep or other forms of sleep-induced substances to record snoring activity [11]. Finally, there is a process of collecting/aggregating snoring and related snore data from different online sources [7]. In this study, we plan to use the third method to collect snoring data directly from the subjects.
Conventional classification methods include SVM, KNN, CNN, RNN, and their hybrids. An SVM was originally developed to classify the data into two classes. According to the authors in [12], it is regarded as a technique used for the classification of linear and nonlinear datasets. A KNN searches for an
Figure 1 shows the generic architecture for the snoring sound classification process [7]. In previous studies, snoring signals were converted into spectrogram images, and feature extraction techniques were then applied. Finally, the extracted features were passed into a classifier for classification.
Several approaches have been proposed for classifying snoring audio sounds. Many of these approaches are based on either the application of feature extraction techniques along with classifier(s), or the direct use of classifier(s) alone. A new rapidly developing field called music information retrieval (MIR), which encompasses musicology, digital signal processing, and machine learning (ML) [15], has recently gained the interest of researchers. In the literature, the approaches used in the field of MIR are efficient for extracting salient features from sound signals. These include Mel-frequency cepstral coefficient (MFCC), RollOff, short-time Fourier transform (STFT), spectral features, zero-crossing rate (ZCR), local binary pattern (LBP), histogram of oriented gradient (HOG), and pulse transition time (PTT). In this study, we combined several of these approaches to extract features from our data before training with a CNN. The results obtained prove the efficiency and effectiveness of MIR techniques.
In recent years, deep neural networks such as AlexNet and region-based CNN (R-CNNs) have been proposed to overcome the limitations of classic neural networks [16]. In this study, a new CNN was designed to classify sounds using the extracted feature sets. The CNN architecture used to classify the snoring and non-snoring feature datasets consists of four convolutional layers, a max-pooling layer, a dropout, and a softmax activation function.
We propose a 1D-CNN model for snoring sound classification based on multi-feature extraction. The essence of the multi-feature extraction process is the extraction of salient features that distinctly distinguish sound signals from closely related signals, which may not achievable using a single extraction technique. In this study, seven features were applied: STFT, root mean square (RMS), spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The overall architecture of the proposed system is shown in Figure 2.
The authors in [7] identified the four most common categories of snoring data acquisition as a publicly available snore sound corpus, snore data provided by a medical organization, snore data created through subjects, and crowdsourced snore data. We adopted the third category by creating snoring data from six individuals [17], i.e., males between 25 and 30 years in age. Two Samsung Galaxy S8 phones were placed 1 m away from the head of the subjects, and recording was conducted during active sleeping hours between 11:30 pm and 5:30 am for two nights. All recorded snoring audio clips were divided into sections having a length of 40 second and sampled at 44.1 kHz; hence, a total of 1,080 snoring episodes were generated. However, owing to the presence of sections without audio, the size of the snoring episodes was reduced to 881. A total of 880 segmented non-snoring audios, including clock ticks, sleep-talk, yawning, pet snoring, door sound, and bed spring sound, were introduced, and snoring audios were labeled as 1, whereas the non-snoring audios were labeled as 0, forming a total data size of 1,761 [17].
A graphical image of the data acquisition is shown in Figure 3. Next, we transformed the audio into spectrogram images and built a 2D-CNN to train them. However, the results were not encouraging. Hence, we proceeded to extract features from the audio dataset, resulting in a new set of data. This feature set led to the use a 1D-CNN network because the dataset contained numeric values in one-dimension. The Excel chart at the bottom of Figure 3 shows the cross-section of the extracted features.
Seven feature extraction approaches were used to generate the feature dataset, as shown in Figure 3. This feature set was fed into a 1D-CNN for classification. The energy of an audio signal is referred to as the total magnitude or loudness of the signal
The RMS is used to characterize the average of a continuously varying audio signal, and its value represents the total waveform. The RMS energy for each frame of the audio signal is defined as
where
The MFCC of a signal is a set of audio signal features that provide a descriptive overview of a spectral envelop [7]. The number of MFCCs was set to 20 and the discrete cosine transform (DCT) type was set to 2. DCT is a function that logs energy from the filter bank and transforms it from the frequency domain to the time domain. In addition, the Mel filter bank estimates the amount of energy that exists at low frequencies.
The Mel frequency,
where
According to the authors of [18], the STFT of a signal for time
where
The spectral centroid identifies the point where the center of mass of an audio sound exists by estimating the weighted mean of the frequencies within the audio signal. In the computation of the spectral centroid, each frame magnitude signal is normalized as follows:
where
The spectral bandwidth expresses the spectral spread of an audio signal. The spectral bandwidth at frame
where
The spectral RollOff frequency is an audio feature that specifies the frequency that contains the majority of the total energy spectrum. According to the Librosa (a Python library for analyzing music and audio) documentation, distinct frames are defined as the center frequency for a spectrogram bin that contains at least 85% of the energy of the spectrum in a frame [19].
The ZCR is the rate of change of the signal sign from positive to negative, or vice versa. The ZCR splits a signal into
where
These seven features are representative features frequently used in signal processing.
We designed a new CNN model to classify snoring data based on multi-feature extraction. To achieve this goal, we utilized TensorFlow, an open-source library developed by Google that primarily focuses on deep learning. It has a comprehensive and flexible ecosystem of tools, libraries, and community resources that allows researchers and developers to easily build and deploy state-of-the-art ML-powered applications. TensorFlow offers multiple levels of abstraction, building, and training models using the high-level Keras API, which makes it easy to get started with TensorFlow and ML. Together with TensorFlow 2.0, Keras has been adopted than any other deep-learning solutions. We used the Keras and TensorFlow libraries for the 1D-CNN and 2D-CNN experiments, respectively. In this study, experiments were conducted using 2D-CNN and 1D-CNN training approaches with different hardware system configurations. We utilized Intel Core i7@3.60 GHz CPU and 32 GB RAM for the 1D-CNN, and another system with a hardware configuration of a single NVIDIA GTX 1080Ti 16 GB GPU, Intel Core i7@3.60 GHz CPU and 32 GB RAM, as in the case of the 2D-CNN. Figure 5 shows conceptual diagrams of the 1D-CNN and 2D-CNN models used in this study.
Initially, we experimented by converting our audio dataset into spectrogram images. The resulting images were then sent to a 2D-CNN classifier. A typical architecture of the proposed 2D-CNN classifier is shown in Figure 6. The input size of the classifier was 32×32×3, and the three convolution layers had ReLU activations. In front of each layer, a max-pooling layer reduces the spatial dimension. In addition, the classifier also had two fully connected layers, and batch normalization was used after each of these layers. We applied a softmax activation function to the second fully connected batch normalization layers. The Adam optimizer was also applied to optimize the loss based on the output of the cross-entropy loss function. Both the number of epochs and batch sizes were 100, and the learning rate was 0.0005. Consequently, an accuracy of 86.80% was achieved, as shown in Figure 7. However, their performance was unsatisfactory. Therefore, we attempted to improve the results using the feature extraction process.
We applied multi-feature extraction technology to significantly improve the classification accuracy. At the end of this process, the extracted feature data are forwarded to the 1D-CNN classifier, as shown in Figure 8. The generated feature data consisted of 26 columns and 1,761 rows. Feature data were divided into 80% for training and 20% for testing. In the proposed 1D-CNN shown in Figure 7, the input shape of 26×1 was converted into 13×2. Four convolutional layers were used, with two fully connected layers (density). The first convolutional layer (Conv1D 1) had 256 filters, a kernel size of 2, and a ReLU activation function. In the second convolution layer, 128 filters were used, and the activation function and kernel size were identical to those in the first layer. The second layer is a max-pooling layer with a window size of 2 to reduce the spatial dimensions of the input entering the third convolutional layer. The third and fourth convolution layers (Conv1D 3 and Conv1D 4) used the same filter size of 64, kernel size of 2, and ReLU activation function. After the fourth layer, we introduced a dropout mechanism by setting 50% of the active neurons in the network to zero, which helped prevent an overfitting during feature data training and increased the classification accuracy. Subsequently, the layer was flattened to obtain a fully connected layer (dense 1) with 32 neurons and ReLU activation. The second dense layer is the output layer with two neurons using a softmax activation function. The cross-entropy loss function is defined to calculate the cost using
where
As shown in Figure 7, the proposed network achieved an accuracy of more than 95% after 20 epochs, achieving 99.7% accuracy at the end of 100 epochs. The multi-feature approach used to extract the features can differentiate between snoring and non-snoring data and consequently pave the way for 1D-CNN classifiers to achieve such a high classification accuracy. Moreover, the architecture of the classifier, including the parameter settings, also contributes significantly to the results. Based on these results, a deep learning model was created to detect snoring/OSA with a high classification accuracy in the presence of noise disturbances. Table 1 lists the variation in accuracy when there were few combined features. However, the MFCC had an extremely large impact depending on the results achieved, whereas the spectral features did not significantly affect the classification accuracy. Nonetheless, the combination of these seven features provided the best overall accuracy.
Table 2 presents the classification results of the existing research on snoring data [6,9–11,20–23]. From Table 2, it is evident that the multi-feature extraction techniques (ZCR+STFT+MFCC) along with the RNN classifier yielded the best result of 98.8% with one of the largest data sizes [7]. Compared with the highest result of 98.8% in Table 2, it can be seen that the accuracy of the model using the proposed MFCC+ZCR+STFT+Spectral+ RMS and 1D-CNN presented in Table 1 is 99.7%, which is a 0.9% improvement. When considering the error rate instead of the accuracy, the 1.2% error rate was reduced to 0.3%, showing that the error rate was reduced to only one-fourth. This confirms that the proposed method is highly effective in classifying data on snoring, which is a precursor to a sleep apnea transition. By contrast, the absence of a feature extraction technique may seriously affect the result of a classifier and thus reduce the classification accuracy, as cited in [22].
We presented a multi-feature extraction approach for the classification of snoring sounds using a CNN classifier. The features include the STFT, RMS, spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The features extracted from the sound dataset were used to form a feature dataset for training. Therefore, a 1D-CNN and 2D-CNN were designed to classify the sound dataset using the extracted feature sets. The experiment results showed that the proposed 1D-CNN achieved accuracies of 99.7% for snoring sounds. This proves that the proposed method is useful for identifying snoring sound datasets. Our study provides a multi-feature extraction process that has the potential to extract essential features that can produce a high classification accuracy from audio sound data. The snoring CNN model can be adopted as an alternative and inexpensive method for diagnosing or detecting snoring/OSA. In addition, the snoring data acquisition approach can be adopted to capture audio datasets in the fields of signal processing, medical and musical information retrieval.
In the future, we plan to expand the field of application by applying the proposed method to other datasets, such as speech, environmental, and animal sounds. For example, cat sound classification research has been previously conducted in Korea [24]. However, it is not easy to apply a deep learning approach to the classification of cat sounds owing to data sparsity. In addition, the data collection method and the proposed feature extraction method in a non-face-to-face environment caused by coronavirus disease 2019 (COVID-19) are expected to be used in various ways in the work-from-home environment with sound data.
A generic architecture for sound classification.
Overall proposed system architecture for 1D-CNN.
Acquisition sequence of snoring data.
A graphical view of MFCC feature from a snore signal.
Conceptual diagrams of (a) 1D-CNN model and (b) 2D-CNN model.
A 2D-CNN architecture using spectrogram images.
Classification accuracy for snoring and non-snoring using multi-feature techniques and spectrogram image.
A 1D-CNN architecture for snoring sound classification.
Table 1 . Variation in accuracy of features using the proposed 1D-CNN on snoring dataset.
Feature | Accuracy (%) |
---|---|
Spectral (Centroid+Bandwidth+RollOff) | 50.18 |
STFT+RMS+Spectral+ZCR | 60.79 |
STFT+RMS+Spectral | 89.46 |
MFCC | 99.00 |
MFCC+ZCR | 99.00 |
Spectral+ZCR+MFCC | 99.20 |
MFCC+ZCR+STFT+Spectral+RMS | 99.70 |
Table 2 . Classification results in existing studies on snoring data.
Feature extraction technique | Classifier | Data size | Test accuracy (%) | ||
---|---|---|---|---|---|
Subject | Training | ||||
Demir et al. [20] | LBP+HOG | SVM | - | 828 | 72.00 |
Lim et al. [11] | ZCR+STFT+MFCC | RNN | 8 | 5600 | 98.80 |
Kang et al. [6] | MFCC | CNN+LSTM | 24 | 24 | 88.28 |
Arsenali et al. [9] | MFCC | RNN | 20 | 5670 | 95.00 |
Khan [21] | MFCC | CNN | - | 1000 | 96.00 |
Wang et al. [22] | - | Dual CNN+GRU | - | 828 | 63.80 |
Tuncer et al. [10] | PTT signal+AlexNet+VGG16 | SVM+KNN | 100 | 100 | 92.78 |
Dalal and Triggs [23] | SCAT+GMM+MAP | MLP | 224 | 282 | 67.71 |
GRU, gated recurrent unit; SCAT, deep scattering spectrum; GMM, Gaussian mixture model; MAP, maximum a posteriori; MLP, multilayer perceptron.
Jeongmin Kim and Hyukdoo Choi
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 105-113 https://doi.org/10.5391/IJFIS.2024.24.2.105Igor V. Arinichev, Sergey V. Polyanskikh, Irina V. Arinicheva, Galina V. Volkova, and Irina P. Matveeva
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(1): 106-115 https://doi.org/10.5391/IJFIS.2022.22.1.106Wang-Su Jeon and Sang-Yong Rhee
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 401-408 https://doi.org/10.5391/IJFIS.2021.21.4.401A generic architecture for sound classification.
|@|~(^,^)~|@|Overall proposed system architecture for 1D-CNN.
|@|~(^,^)~|@|Acquisition sequence of snoring data.
|@|~(^,^)~|@|A graphical view of MFCC feature from a snore signal.
|@|~(^,^)~|@|Conceptual diagrams of (a) 1D-CNN model and (b) 2D-CNN model.
|@|~(^,^)~|@|A 2D-CNN architecture using spectrogram images.
|@|~(^,^)~|@|Classification accuracy for snoring and non-snoring using multi-feature techniques and spectrogram image.
|@|~(^,^)~|@|A 1D-CNN architecture for snoring sound classification.