Article Search
닫기

## Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(1): 1-10

Published online March 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.1.1

© The Korean Institute of Intelligent Systems

## Snoring Sound Classification Using 1D-CNN Model Based on Multi-Feature Extraction

Tosin Akinwale Adesuyi1, Byeong Man Kim1 , and Jongwan Kim2

1Department of Computer and Software Engineering, Kumoh National Institute of Technology, Gumi, Korea
2Division of Computer and Information Engineering, Daegu University, Gyeongsan, Korea

Correspondence to :
Byeong Man Kim (bmkim@kumoh.ac.kr)

Received: June 30, 2021; Revised: September 23, 2021; Accepted: November 9, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Sound is an essential element of human relationships and communication. The sound recognition process involves three phases: signal preprocessing, feature extraction, and classification. This paper describes research on the classification of snoring data used to determine the importance of sleep health in humans. However, current sound classification methods using deep learning approaches do not yield desirable results for building good models. This is because some of the salient features required to sufficiently discriminate sounds and improve the accuracy of the classification are poorly captured during training. In this study, we propose a new convolutional neural network (CNN) model for sound classification using multi-feature extraction. The extracted features were used to form a new dataset that was used as the input to the CNN. Experiments were conducted on snoring and non-snoring datasets. The accuracy of the proposed model was 99.7% for snoring sounds, demonstrating an almost perfect classification and superior results compared to existing methods.

Keywords: Sound recognition, Snoring sound, CNN, Multi-feature extraction

Sound can be described as pressure caused by the vibration of an object. Sound is an important aspect in human relationships and communication. Recent research on sound recognition has received considerable attention because it is used in various fields such as bio-acoustic monitoring, wild animal intrusion detection, environmental sound, and audio surveillance [1]. The sound-recognition process includes three steps: signal preprocessing, feature extraction, and classification [2]. This study is focused on snoring sound classification because snoring has a significant impact on sleep health.

Sleep is unique to humans and animals. A phenomenon called snoring occurs when there is a sleep disorder. Snoring is classified as primary snoring, which involves no comorbidity, and obstructive sleep apnea (OSA), which poses health risks [3]. Snoring noise is estimated to reach approximately 90 to 100 dB, which can cause hearing loss in people next to the snoring person [4]. OSA is a sleep-related respiratory disorder that repeatedly causes a partial or complete obstruction of the upper airways during sleep [5]. An easy way to diagnose OSA is to use polysomnography; however, this method is expensive [6]. Signal processing and artificial intelligence (AI) are the most promising approaches among the existing methods for diagnosing snoring, as in the case of apnea [7]. In the AI field, classifiers such as a support vector machine (SVM), K-nearest neighbor (KNN), recurrent neural network (RNN), and convolutional neural network (CNN) are appropriate solutions. However, these classifiers rely on the feature extraction techniques used in signal processing to extract features from the snoring audio signal to conduct an accurate classification. In this case, snoring data and feature extraction are the most important factors because they are required to train the classifier. Several approaches to snoring dataset acquisition and frequently used feature extraction methods are described in [7]. According to the dataset made available by the INTERSPEECH 2017 ComParE Snoring Challenge [8], OSA is categorized into four classes based on the source of obstruction that leads to the snoring: velum (V), oropharyngeal (O), tongue (T), and epiglottis (E). Distinguishing background noise from sleep sounds (e.g., sleep talking), which are considered noise from real snoring sounds, is a subject that needs to be addressed. Addressing this challenge using a deep learning model will facilitate a more accurate classification/detection that distinguishes snoring sounds from non-snoring sounds. The authors in [9] stressed that this step is necessary but insufficient and further emphasized the need for the right choice of feature extraction techniques to acquire a desirable result.

In this paper, we presented a one-dimensional (1D) CNN for snoring and non-snoring sound classification based on several signal processing approaches for feature extraction. The main contribution of this study is to overcome the challenges mentioned above and provide a deep learning model that serves as an alternative method for distinguishing snoring sounds with an extremely high classification accuracy. The contributions of this study are summarized as follows:

• We generated spectrogram images from our dataset and fed them into the 2D-CNN network; however, the resulting accuracy was poor.

• We extracted unique features from our snoring/non-snoring sound dataset according to their importance in boosting the classification accuracy.

• The extracted features served as a new dataset (numerical data) and were fed into a 1D-CNN network. The resulting output was significant.

• Finally, we present the research results obtained by comparing and evaluating the proposed method with other existing approaches.

The rest of this paper is organized as follows: Section 2 describes previous related studies, including the feature extraction process and classification methods. Sections 3 and 4 present the proposed approach and experiment results and discussion, respectively. Finally, Section 5 provides some concluding remarks regarding this study.

To achieve this goal, a detailed review of related studies was conducted. First, we describe the snoring data acquisition methods. Based on previous research [7], snoring data sources can be categorized into four categories: online available snore sound corpora, snore data provided by medical organizations, snore data from subjects, and crowdsourced snore data. A typical example of an online snoring sound corpus is the Munich-Passau Snore Sound Corpus, which consists of 828 snore audio samples grouped into four categories: V, O, T, and E [8]. Some researchers have collected snore data from patients with apnea in hospitals or medical organizations [10]. Snoring audio can be captured from people who have consented to be subjected to sleep or other forms of sleep-induced substances to record snoring activity [11]. Finally, there is a process of collecting/aggregating snoring and related snore data from different online sources [7]. In this study, we plan to use the third method to collect snoring data directly from the subjects.

Conventional classification methods include SVM, KNN, CNN, RNN, and their hybrids. An SVM was originally developed to classify the data into two classes. According to the authors in [12], it is regarded as a technique used for the classification of linear and nonlinear datasets. A KNN searches for an n-dimensional pattern space using the training data and classifies them into k numbers based on the nearest neighbors [13]. An RNN are fashioned to learn from sequential information. One issue with an RNN is the vanishing or exploding gradient problem. This problem arises because of the long input-output sequence. Long short-term memory (LSTM) was designed to overcome this problem. LSTM works by allowing input x(t) at time t to influence the storage or overwriting of memories stored in the cell. CNN models are a type of deep neural network developed for image recognition, such that the model is fed with the image pixels as a 2D input. A CNN is based on the convolution of images and the extraction of salient features through filters that are learned by the network during the training phase. A 1D-CNN was cited in [14] as having less computational complexity over a 2D-CNN, having a shallow architecture with the potential to learn complex features, requiring fewer computational resources (using a CPU rather than a GPU), and being well suited for real-time and low-cost applications on hand-held devices.

Figure 1 shows the generic architecture for the snoring sound classification process [7]. In previous studies, snoring signals were converted into spectrogram images, and feature extraction techniques were then applied. Finally, the extracted features were passed into a classifier for classification.

Several approaches have been proposed for classifying snoring audio sounds. Many of these approaches are based on either the application of feature extraction techniques along with classifier(s), or the direct use of classifier(s) alone. A new rapidly developing field called music information retrieval (MIR), which encompasses musicology, digital signal processing, and machine learning (ML) [15], has recently gained the interest of researchers. In the literature, the approaches used in the field of MIR are efficient for extracting salient features from sound signals. These include Mel-frequency cepstral coefficient (MFCC), RollOff, short-time Fourier transform (STFT), spectral features, zero-crossing rate (ZCR), local binary pattern (LBP), histogram of oriented gradient (HOG), and pulse transition time (PTT). In this study, we combined several of these approaches to extract features from our data before training with a CNN. The results obtained prove the efficiency and effectiveness of MIR techniques.

In recent years, deep neural networks such as AlexNet and region-based CNN (R-CNNs) have been proposed to overcome the limitations of classic neural networks [16]. In this study, a new CNN was designed to classify sounds using the extracted feature sets. The CNN architecture used to classify the snoring and non-snoring feature datasets consists of four convolutional layers, a max-pooling layer, a dropout, and a softmax activation function.

### 3.1 Data Acquisition and Feature Extraction

We propose a 1D-CNN model for snoring sound classification based on multi-feature extraction. The essence of the multi-feature extraction process is the extraction of salient features that distinctly distinguish sound signals from closely related signals, which may not achievable using a single extraction technique. In this study, seven features were applied: STFT, root mean square (RMS), spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The overall architecture of the proposed system is shown in Figure 2.

The authors in [7] identified the four most common categories of snoring data acquisition as a publicly available snore sound corpus, snore data provided by a medical organization, snore data created through subjects, and crowdsourced snore data. We adopted the third category by creating snoring data from six individuals [17], i.e., males between 25 and 30 years in age. Two Samsung Galaxy S8 phones were placed 1 m away from the head of the subjects, and recording was conducted during active sleeping hours between 11:30 pm and 5:30 am for two nights. All recorded snoring audio clips were divided into sections having a length of 40 second and sampled at 44.1 kHz; hence, a total of 1,080 snoring episodes were generated. However, owing to the presence of sections without audio, the size of the snoring episodes was reduced to 881. A total of 880 segmented non-snoring audios, including clock ticks, sleep-talk, yawning, pet snoring, door sound, and bed spring sound, were introduced, and snoring audios were labeled as 1, whereas the non-snoring audios were labeled as 0, forming a total data size of 1,761 [17].

A graphical image of the data acquisition is shown in Figure 3. Next, we transformed the audio into spectrogram images and built a 2D-CNN to train them. However, the results were not encouraging. Hence, we proceeded to extract features from the audio dataset, resulting in a new set of data. This feature set led to the use a 1D-CNN network because the dataset contained numeric values in one-dimension. The Excel chart at the bottom of Figure 3 shows the cross-section of the extracted features.

Seven feature extraction approaches were used to generate the feature dataset, as shown in Figure 3. This feature set was fed into a 1D-CNN for classification. The energy of an audio signal is referred to as the total magnitude or loudness of the signal x(n), and is represented as

E=nx(n)2.

The RMS is used to characterize the average of a continuously varying audio signal, and its value represents the total waveform. The RMS energy for each frame of the audio signal is defined as

RMS=1nnx(n)2,

where n denotes the number of waveforms in a frame. The frame length was set to 2048, and the hop length was set to 512. The essence of the RMS is enabling the detection of a sudden increase in audio signal energy, which helps to provide a clear distinction between silence or a low signal mode and voiced mode.

The MFCC of a signal is a set of audio signal features that provide a descriptive overview of a spectral envelop [7]. The number of MFCCs was set to 20 and the discrete cosine transform (DCT) type was set to 2. DCT is a function that logs energy from the filter bank and transforms it from the frequency domain to the time domain. In addition, the Mel filter bank estimates the amount of energy that exists at low frequencies.

The Mel frequency, Mel(f), can be defined as follows:

Mel(f)=2595log10(f700+1),

where f is the linear scale frequency, and m is the number of MFCCs. Figure 4 shows an MFCC plot with 20 features in the time series. The x-axis in Figure 4 is measured in seconds, whereas the y-axis is measured in decibels.

According to the authors of [18], the STFT of a signal for time t is computed as

STFT(ω,τ)=-x(t)ω*(t-τ)e-jωτdt,

where ω(tτ) is a window function positioned at τ on the time axis and can be expressed in a Gaussian form.

The spectral centroid identifies the point where the center of mass of an audio sound exists by estimating the weighted mean of the frequencies within the audio signal. In the computation of the spectral centroid, each frame magnitude signal is normalized as follows:

x˜(n)=x(k)kx(k),

where x(k) represents the discrete audio signal, and f(k) is the frequency at bin k. Hence, the spectral centroid is computed as [1]

Sc=kf(k)x˜(k).

The spectral bandwidth expresses the spectral spread of an audio signal. The spectral bandwidth at frame t is calculated as [1]

Sb[t]=(k([k,t]-Sc[k]px˜[k,t])1/p,

where [k, t] is the spectral magnitude in frequency bin k, [k, t] is the frequency in bin k, and Sc[k] is the spectral centroid.

The spectral RollOff frequency is an audio feature that specifies the frequency that contains the majority of the total energy spectrum. According to the Librosa (a Python library for analyzing music and audio) documentation, distinct frames are defined as the center frequency for a spectrogram bin that contains at least 85% of the energy of the spectrum in a frame [19].

The ZCR is the rate of change of the signal sign from positive to negative, or vice versa. The ZCR splits a signal into K frames such that {fi(n) : 1 ≤ iK}, and for each frame, zcri is computed as [1]

zcri=n=1r-1sgn[fi(n)×fi(n-1)],

where r is the number of samples in each frame and

sgn(fi(n))={1,if fi(n)>0,0,otherwise.

These seven features are representative features frequently used in signal processing.

### 3.2 Deep Learning Model and Implementation

We designed a new CNN model to classify snoring data based on multi-feature extraction. To achieve this goal, we utilized TensorFlow, an open-source library developed by Google that primarily focuses on deep learning. It has a comprehensive and flexible ecosystem of tools, libraries, and community resources that allows researchers and developers to easily build and deploy state-of-the-art ML-powered applications. TensorFlow offers multiple levels of abstraction, building, and training models using the high-level Keras API, which makes it easy to get started with TensorFlow and ML. Together with TensorFlow 2.0, Keras has been adopted than any other deep-learning solutions. We used the Keras and TensorFlow libraries for the 1D-CNN and 2D-CNN experiments, respectively. In this study, experiments were conducted using 2D-CNN and 1D-CNN training approaches with different hardware system configurations. We utilized Intel Core i7@3.60 GHz CPU and 32 GB RAM for the 1D-CNN, and another system with a hardware configuration of a single NVIDIA GTX 1080Ti 16 GB GPU, Intel Core i7@3.60 GHz CPU and 32 GB RAM, as in the case of the 2D-CNN. Figure 5 shows conceptual diagrams of the 1D-CNN and 2D-CNN models used in this study.

### 4.1 Experiments and Results

Initially, we experimented by converting our audio dataset into spectrogram images. The resulting images were then sent to a 2D-CNN classifier. A typical architecture of the proposed 2D-CNN classifier is shown in Figure 6. The input size of the classifier was 32×32×3, and the three convolution layers had ReLU activations. In front of each layer, a max-pooling layer reduces the spatial dimension. In addition, the classifier also had two fully connected layers, and batch normalization was used after each of these layers. We applied a softmax activation function to the second fully connected batch normalization layers. The Adam optimizer was also applied to optimize the loss based on the output of the cross-entropy loss function. Both the number of epochs and batch sizes were 100, and the learning rate was 0.0005. Consequently, an accuracy of 86.80% was achieved, as shown in Figure 7. However, their performance was unsatisfactory. Therefore, we attempted to improve the results using the feature extraction process.

We applied multi-feature extraction technology to significantly improve the classification accuracy. At the end of this process, the extracted feature data are forwarded to the 1D-CNN classifier, as shown in Figure 8. The generated feature data consisted of 26 columns and 1,761 rows. Feature data were divided into 80% for training and 20% for testing. In the proposed 1D-CNN shown in Figure 7, the input shape of 26×1 was converted into 13×2. Four convolutional layers were used, with two fully connected layers (density). The first convolutional layer (Conv1D 1) had 256 filters, a kernel size of 2, and a ReLU activation function. In the second convolution layer, 128 filters were used, and the activation function and kernel size were identical to those in the first layer. The second layer is a max-pooling layer with a window size of 2 to reduce the spatial dimensions of the input entering the third convolutional layer. The third and fourth convolution layers (Conv1D 3 and Conv1D 4) used the same filter size of 64, kernel size of 2, and ReLU activation function. After the fourth layer, we introduced a dropout mechanism by setting 50% of the active neurons in the network to zero, which helped prevent an overfitting during feature data training and increased the classification accuracy. Subsequently, the layer was flattened to obtain a fully connected layer (dense 1) with 32 neurons and ReLU activation. The second dense layer is the output layer with two neurons using a softmax activation function. The cross-entropy loss function is defined to calculate the cost using Eq. (10). As a result, we applied the Adam optimizer to minimize the loss. The network batch and epoch sizes were both set to 100.

cost=-i=1nyilog y^i+(1-yi)log(1-y^i),

where yi and ŷi represents the target label and the estimated value, respectively.

As shown in Figure 7, the proposed network achieved an accuracy of more than 95% after 20 epochs, achieving 99.7% accuracy at the end of 100 epochs. The multi-feature approach used to extract the features can differentiate between snoring and non-snoring data and consequently pave the way for 1D-CNN classifiers to achieve such a high classification accuracy. Moreover, the architecture of the classifier, including the parameter settings, also contributes significantly to the results. Based on these results, a deep learning model was created to detect snoring/OSA with a high classification accuracy in the presence of noise disturbances. Table 1 lists the variation in accuracy when there were few combined features. However, the MFCC had an extremely large impact depending on the results achieved, whereas the spectral features did not significantly affect the classification accuracy. Nonetheless, the combination of these seven features provided the best overall accuracy.

### 4.2 Discussion

Table 2 presents the classification results of the existing research on snoring data [6,911,2023]. From Table 2, it is evident that the multi-feature extraction techniques (ZCR+STFT+MFCC) along with the RNN classifier yielded the best result of 98.8% with one of the largest data sizes [7]. Compared with the highest result of 98.8% in Table 2, it can be seen that the accuracy of the model using the proposed MFCC+ZCR+STFT+Spectral+ RMS and 1D-CNN presented in Table 1 is 99.7%, which is a 0.9% improvement. When considering the error rate instead of the accuracy, the 1.2% error rate was reduced to 0.3%, showing that the error rate was reduced to only one-fourth. This confirms that the proposed method is highly effective in classifying data on snoring, which is a precursor to a sleep apnea transition. By contrast, the absence of a feature extraction technique may seriously affect the result of a classifier and thus reduce the classification accuracy, as cited in [22].

We presented a multi-feature extraction approach for the classification of snoring sounds using a CNN classifier. The features include the STFT, RMS, spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The features extracted from the sound dataset were used to form a feature dataset for training. Therefore, a 1D-CNN and 2D-CNN were designed to classify the sound dataset using the extracted feature sets. The experiment results showed that the proposed 1D-CNN achieved accuracies of 99.7% for snoring sounds. This proves that the proposed method is useful for identifying snoring sound datasets. Our study provides a multi-feature extraction process that has the potential to extract essential features that can produce a high classification accuracy from audio sound data. The snoring CNN model can be adopted as an alternative and inexpensive method for diagnosing or detecting snoring/OSA. In addition, the snoring data acquisition approach can be adopted to capture audio datasets in the fields of signal processing, medical and musical information retrieval.

In the future, we plan to expand the field of application by applying the proposed method to other datasets, such as speech, environmental, and animal sounds. For example, cat sound classification research has been previously conducted in Korea [24]. However, it is not easy to apply a deep learning approach to the classification of cat sounds owing to data sparsity. In addition, the data collection method and the proposed feature extraction method in a non-face-to-face environment caused by coronavirus disease 2019 (COVID-19) are expected to be used in various ways in the work-from-home environment with sound data.

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (No. NRF-2020R1F1A104833611).

Fig. 1.

A generic architecture for sound classification.

Fig. 2.

Overall proposed system architecture for 1D-CNN.

Fig. 3.

Acquisition sequence of snoring data.

Fig. 4.

A graphical view of MFCC feature from a snore signal.

Fig. 5.

Conceptual diagrams of (a) 1D-CNN model and (b) 2D-CNN model.

Fig. 6.

A 2D-CNN architecture using spectrogram images.

Fig. 7.

Classification accuracy for snoring and non-snoring using multi-feature techniques and spectrogram image.

Fig. 8.

A 1D-CNN architecture for snoring sound classification.

Table. 1.

Table 1. Variation in accuracy of features using the proposed 1D-CNN on snoring dataset.

FeatureAccuracy (%)
Spectral (Centroid+Bandwidth+RollOff)50.18
STFT+RMS+Spectral+ZCR60.79
STFT+RMS+Spectral89.46
MFCC99.00
MFCC+ZCR99.00
Spectral+ZCR+MFCC99.20
MFCC+ZCR+STFT+Spectral+RMS99.70

Table. 2.

Table 2. Classification results in existing studies on snoring data.

Feature extraction techniqueClassifierData sizeTest accuracy (%)
SubjectTraining
Demir et al. [20]LBP+HOGSVM-82872.00
Lim et al. [11]ZCR+STFT+MFCCRNN8560098.80
Kang et al. [6]MFCCCNN+LSTM242488.28
Arsenali et al. [9]MFCCRNN20567095.00
Khan [21]MFCCCNN-100096.00
Wang et al. [22]-Dual CNN+GRU-82863.80
Tuncer et al. [10]PTT signal+AlexNet+VGG16SVM+KNN10010092.78
Dalal and Triggs [23]SCAT+GMM+MAPMLP22428267.71

GRU, gated recurrent unit; SCAT, deep scattering spectrum; GMM, Gaussian mixture model; MAP, maximum a posteriori; MLP, multilayer perceptron.

1. Li, FF, and Cox, TJ (2019). Digital Signal Processing in Audio and Acoustical Engineering. Boca Raton, FL: CRC Press
2. Khamparia, A, Gupta, D, Nguyen, NG, Khanna, A, Pandey, B, and Tiwari, P (2019). Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access. 7, 7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882
3. Parra, O, Arboix, A, Montserrat, JM, Quinto, L, Bechich, S, and Garcia-Eroles, L (2004). Sleep-related breathing disorders: impact on mortality of cerebrovascular disease. European Respiratory Journal. 24, 267-272. https://doi.org/10.1183/09031936.04.00061503
4. Zhang, X, and Qiu, X (2017). Performance of a snoring noise control system based on an active partition. Applied Acoustics. 116, 283-290. https://doi.org/10.1016/j.apacoust.2016.09.032
5. Okuno, K, Furuhashi, A, Nakamura, S, Suzuki, H, Arisaka, T, and Taga, H (2019). Japanese cross-sectional multicenter survey (JAMS) of oral appliance therapy in the management of obstructive sleep apnea. International Journal of Environmental Research and Public Health. 16. article no. 3288
6. Kang, B, Dang, X, and Wei, R . Snoring and apnea detection based on hybrid neural networks., Proceedings of 2017 International Conference on Orange Technologies (ICOT), 2017, Singapore, Array, pp.57-60. https://doi.org/10.1109/ICOT.2017.8336088
7. Adesuyi, TA, Kim, BM, and Shin, YS (2020). A brief on snoring data and classification methods. International Journal of Advanced Trends in Computer Science and Engineering. 9, 426-432. https://doi.org/10.30534/ijatcse/2020/59912020
8. Amiriparian, S, Gerczuk, M, Ottl, S, Cummins, N, Freitag, M, Pugachevskiy, S, Baird, A, and Schuller, B . Snore sound classification using image-based deep spectrum features., Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017, Stockholm, Sweden, Array, pp.3512-3516. https://doi.org/10.21437/interspeech.2017-434
9. Arsenali, B, van Dijk, J, Ouweltjes, O, den Brinker, B, Pevernagie, D, Krijn, R, van Gilst, M, and Overeem, S . Recurrent neural network for classification of snoring and non-snoring sound events., Proceedings of 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2018, Honolulu, HI, Array, pp.328-331. https://doi.org/10.1109/EMBC.2018.8512251
10. Tuncer, SA, Akılotu, B, and Toraman, S (2019). A deep learning-based decision support system for diagnosis of OSAS using PTT signals. Medical Hypotheses. 127, 15-22. https://doi.org/10.1016/j.mehy.2019.03.026
11. Lim, SJ, Jang, SJ, Lim, JY, and Ko, JH (2019). Classification of snoring sound based on a recurrent neural network. Expert Systems with Applications. 123, 237-245. https://doi.org/10.1016/j.eswa.2019.01.020
12. Arasi, MA, and Babu, S (2019). Survey of machine learning techniques in medical imaging. International Journal of Advanced Trends in Computer Science and Engineering. 8, 2107-2116. https://doi.org/10.30534/ijatcse/2019/39852019
13. Huang, J, Wei, Y, Yi, J, and Liu, M . An improved kNN based on class contribution and feature weighting., Proceedings of 2018 10th international conference on measuring technology and mechatronics automation (ICMTMA), 2018, Changsha, China, Array, pp.313-316. https://doi.org/10.1109/ICMTMA.2018.00083
14. Kiranyaz, S, Avci, O, Abdeljaber, O, Ince, T, Gabbouj, M, and Inman, DJ. (2019) . 1D convolutional neural networks and applications: a survey. [Online]. Available: https://arxiv.org/abs/1905.03554
15. McFee, B, Raffel, C, Liang, D, Ellis, DP, McVicar, M, Battenberg, E, and Nieto, O . librosa: audio and music signal analysis in Python., Proceedings of the 14th Python in Science Conference, 2015, Array, pp.18-25. https://doi.org/10.25080/majora-7b98e3ed-003
16. Faizy, Jal. (2017) . 10 Advanced deep learning architectures data scientists should know!. [Online]. Available: https://www.analyticsvidhya.com/blog/2017/08/10-advanced-deep-learning-architectures-data-scientists/
17. Adesuyi, TA 2020. A Convolutional Neural Network Model for Sound Classification Based on Multi-Feature Extraction. PhD dissertation. Department of Software Engineering, Kumoh National University of Technology. Korea.
18. Yu, S, You, X, Ou, W, Jiang, X, Zhao, K, Zhu, Z, Mou, Y, and Zhao, X (2016). STFT-like time frequency representations of nonstationary signal with arbitrary sampling schemes. Neurocomputing. 204, 211-221. https://doi.org/10.1016/j.neucom.2015.08.130
19. Demir, F, Sengur, A, Cummins, N, Amiriparian, S, and Schuller, B (). Low level texture features for snore sound discrimination, 413-416. https://doi.org/10.1109/EMBC.2018.8512459
20. Khan, T (2019). A deep learning model for snoring detection and vibration notification using a smart wearable gadget. Electronics. 8. article no. 987
21. Wang, J, Stromfeli, H, and Schuller, BW . A CNN-GRU approach to capture time-frequency pattern interdependence for snore sound classification., Proceedings of 2018 26th European Signal Processing Conference (EUSIPCO), 2018, Rome, Italy, Array, pp.997-1001. https://doi.org/10.23919/EUSIPCO.2018.8553521
22. Dalal, N, and Triggs, B . Histograms of oriented gradients for human detection., Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005, San Diego, CA, Array, pp.886-893. https://doi.org/10.1109/CVPR.2005.177
23. Pandeya, YR, and Lee, J (2018). Domestic cat sound classification using transfer learning. International Journal of Fuzzy Logic and Intelligent Systems. 18, 154-160. https://doi.org/10.5391/IJFIS.2018.18.2.154

Tosin A. Adesuyi received his B.Tech. and M.Tech. degrees in computer science from the Federal University of Technology, Akure, Nigeria, in 2010 and 2014, respectively. He has his Ph.D. in Artificial Intelligence from the Department of Software Engineering, Kumoh National Institute of Technology, Korea in 2020. His research work has been published in premier conferences and journals. His research areas include artificial intelligence, computer vision, privacy, e-learning, deep learning and accelerated optimized AI models. He currently works as a GPU advocate at NVIDIA Corp.

Byeong Man Kim received the B.S. degree in the Department of computer Engineering from Seoul National University (SNU), Korea, in 1987, and the M.S. and the Ph.D. degrees in computer science from Korea Advanced Institute of Science and Technology (KAIST), Korea, in 1989 and 1992, respectively. He has been with Kumoh National Institute of Technology since 1992 as a faculty member of Computer Software Engineering Department. From 1998–1999, he was a post-doctoral fellow in UC, Irvine. From 2005–2006, he was a visiting scholar at Dept. of Computer Science of Colorado State University, working on design of a collaborative web agent based on friend network. His current research areas include artificial intelligence, information filtering, information security and brain computer interface.

E-mail: bmkim@kumoh.ac.kr

Jongwan Kim received B.S., M.S., and Ph.D. degrees in Department of Computer Engineering from Seoul National University, Korea, in 1987, 1989, and 1994, respectively. He has been with Daegu University since 1995 and is currently a professor. His research interests include artificial intelligence, internet dysfunction, human computer interaction, and IT convergence education.

E-mail: jwkim@daegu.ac.kr

### Article

#### Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(1): 1-10

Published online March 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.1.1

## Snoring Sound Classification Using 1D-CNN Model Based on Multi-Feature Extraction

Tosin Akinwale Adesuyi1, Byeong Man Kim1 , and Jongwan Kim2

1Department of Computer and Software Engineering, Kumoh National Institute of Technology, Gumi, Korea
2Division of Computer and Information Engineering, Daegu University, Gyeongsan, Korea

Correspondence to:Byeong Man Kim (bmkim@kumoh.ac.kr)

Received: June 30, 2021; Revised: September 23, 2021; Accepted: November 9, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

Sound is an essential element of human relationships and communication. The sound recognition process involves three phases: signal preprocessing, feature extraction, and classification. This paper describes research on the classification of snoring data used to determine the importance of sleep health in humans. However, current sound classification methods using deep learning approaches do not yield desirable results for building good models. This is because some of the salient features required to sufficiently discriminate sounds and improve the accuracy of the classification are poorly captured during training. In this study, we propose a new convolutional neural network (CNN) model for sound classification using multi-feature extraction. The extracted features were used to form a new dataset that was used as the input to the CNN. Experiments were conducted on snoring and non-snoring datasets. The accuracy of the proposed model was 99.7% for snoring sounds, demonstrating an almost perfect classification and superior results compared to existing methods.

Keywords: Sound recognition, Snoring sound, CNN, Multi-feature extraction

### 1. Introduction

Sound can be described as pressure caused by the vibration of an object. Sound is an important aspect in human relationships and communication. Recent research on sound recognition has received considerable attention because it is used in various fields such as bio-acoustic monitoring, wild animal intrusion detection, environmental sound, and audio surveillance [1]. The sound-recognition process includes three steps: signal preprocessing, feature extraction, and classification [2]. This study is focused on snoring sound classification because snoring has a significant impact on sleep health.

Sleep is unique to humans and animals. A phenomenon called snoring occurs when there is a sleep disorder. Snoring is classified as primary snoring, which involves no comorbidity, and obstructive sleep apnea (OSA), which poses health risks [3]. Snoring noise is estimated to reach approximately 90 to 100 dB, which can cause hearing loss in people next to the snoring person [4]. OSA is a sleep-related respiratory disorder that repeatedly causes a partial or complete obstruction of the upper airways during sleep [5]. An easy way to diagnose OSA is to use polysomnography; however, this method is expensive [6]. Signal processing and artificial intelligence (AI) are the most promising approaches among the existing methods for diagnosing snoring, as in the case of apnea [7]. In the AI field, classifiers such as a support vector machine (SVM), K-nearest neighbor (KNN), recurrent neural network (RNN), and convolutional neural network (CNN) are appropriate solutions. However, these classifiers rely on the feature extraction techniques used in signal processing to extract features from the snoring audio signal to conduct an accurate classification. In this case, snoring data and feature extraction are the most important factors because they are required to train the classifier. Several approaches to snoring dataset acquisition and frequently used feature extraction methods are described in [7]. According to the dataset made available by the INTERSPEECH 2017 ComParE Snoring Challenge [8], OSA is categorized into four classes based on the source of obstruction that leads to the snoring: velum (V), oropharyngeal (O), tongue (T), and epiglottis (E). Distinguishing background noise from sleep sounds (e.g., sleep talking), which are considered noise from real snoring sounds, is a subject that needs to be addressed. Addressing this challenge using a deep learning model will facilitate a more accurate classification/detection that distinguishes snoring sounds from non-snoring sounds. The authors in [9] stressed that this step is necessary but insufficient and further emphasized the need for the right choice of feature extraction techniques to acquire a desirable result.

In this paper, we presented a one-dimensional (1D) CNN for snoring and non-snoring sound classification based on several signal processing approaches for feature extraction. The main contribution of this study is to overcome the challenges mentioned above and provide a deep learning model that serves as an alternative method for distinguishing snoring sounds with an extremely high classification accuracy. The contributions of this study are summarized as follows:

• We generated spectrogram images from our dataset and fed them into the 2D-CNN network; however, the resulting accuracy was poor.

• We extracted unique features from our snoring/non-snoring sound dataset according to their importance in boosting the classification accuracy.

• The extracted features served as a new dataset (numerical data) and were fed into a 1D-CNN network. The resulting output was significant.

• Finally, we present the research results obtained by comparing and evaluating the proposed method with other existing approaches.

The rest of this paper is organized as follows: Section 2 describes previous related studies, including the feature extraction process and classification methods. Sections 3 and 4 present the proposed approach and experiment results and discussion, respectively. Finally, Section 5 provides some concluding remarks regarding this study.

### 2. Related Work

To achieve this goal, a detailed review of related studies was conducted. First, we describe the snoring data acquisition methods. Based on previous research [7], snoring data sources can be categorized into four categories: online available snore sound corpora, snore data provided by medical organizations, snore data from subjects, and crowdsourced snore data. A typical example of an online snoring sound corpus is the Munich-Passau Snore Sound Corpus, which consists of 828 snore audio samples grouped into four categories: V, O, T, and E [8]. Some researchers have collected snore data from patients with apnea in hospitals or medical organizations [10]. Snoring audio can be captured from people who have consented to be subjected to sleep or other forms of sleep-induced substances to record snoring activity [11]. Finally, there is a process of collecting/aggregating snoring and related snore data from different online sources [7]. In this study, we plan to use the third method to collect snoring data directly from the subjects.

Conventional classification methods include SVM, KNN, CNN, RNN, and their hybrids. An SVM was originally developed to classify the data into two classes. According to the authors in [12], it is regarded as a technique used for the classification of linear and nonlinear datasets. A KNN searches for an n-dimensional pattern space using the training data and classifies them into k numbers based on the nearest neighbors [13]. An RNN are fashioned to learn from sequential information. One issue with an RNN is the vanishing or exploding gradient problem. This problem arises because of the long input-output sequence. Long short-term memory (LSTM) was designed to overcome this problem. LSTM works by allowing input x(t) at time t to influence the storage or overwriting of memories stored in the cell. CNN models are a type of deep neural network developed for image recognition, such that the model is fed with the image pixels as a 2D input. A CNN is based on the convolution of images and the extraction of salient features through filters that are learned by the network during the training phase. A 1D-CNN was cited in [14] as having less computational complexity over a 2D-CNN, having a shallow architecture with the potential to learn complex features, requiring fewer computational resources (using a CPU rather than a GPU), and being well suited for real-time and low-cost applications on hand-held devices.

Figure 1 shows the generic architecture for the snoring sound classification process [7]. In previous studies, snoring signals were converted into spectrogram images, and feature extraction techniques were then applied. Finally, the extracted features were passed into a classifier for classification.

Several approaches have been proposed for classifying snoring audio sounds. Many of these approaches are based on either the application of feature extraction techniques along with classifier(s), or the direct use of classifier(s) alone. A new rapidly developing field called music information retrieval (MIR), which encompasses musicology, digital signal processing, and machine learning (ML) [15], has recently gained the interest of researchers. In the literature, the approaches used in the field of MIR are efficient for extracting salient features from sound signals. These include Mel-frequency cepstral coefficient (MFCC), RollOff, short-time Fourier transform (STFT), spectral features, zero-crossing rate (ZCR), local binary pattern (LBP), histogram of oriented gradient (HOG), and pulse transition time (PTT). In this study, we combined several of these approaches to extract features from our data before training with a CNN. The results obtained prove the efficiency and effectiveness of MIR techniques.

In recent years, deep neural networks such as AlexNet and region-based CNN (R-CNNs) have been proposed to overcome the limitations of classic neural networks [16]. In this study, a new CNN was designed to classify sounds using the extracted feature sets. The CNN architecture used to classify the snoring and non-snoring feature datasets consists of four convolutional layers, a max-pooling layer, a dropout, and a softmax activation function.

### 3.1 Data Acquisition and Feature Extraction

We propose a 1D-CNN model for snoring sound classification based on multi-feature extraction. The essence of the multi-feature extraction process is the extraction of salient features that distinctly distinguish sound signals from closely related signals, which may not achievable using a single extraction technique. In this study, seven features were applied: STFT, root mean square (RMS), spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The overall architecture of the proposed system is shown in Figure 2.

The authors in [7] identified the four most common categories of snoring data acquisition as a publicly available snore sound corpus, snore data provided by a medical organization, snore data created through subjects, and crowdsourced snore data. We adopted the third category by creating snoring data from six individuals [17], i.e., males between 25 and 30 years in age. Two Samsung Galaxy S8 phones were placed 1 m away from the head of the subjects, and recording was conducted during active sleeping hours between 11:30 pm and 5:30 am for two nights. All recorded snoring audio clips were divided into sections having a length of 40 second and sampled at 44.1 kHz; hence, a total of 1,080 snoring episodes were generated. However, owing to the presence of sections without audio, the size of the snoring episodes was reduced to 881. A total of 880 segmented non-snoring audios, including clock ticks, sleep-talk, yawning, pet snoring, door sound, and bed spring sound, were introduced, and snoring audios were labeled as 1, whereas the non-snoring audios were labeled as 0, forming a total data size of 1,761 [17].

A graphical image of the data acquisition is shown in Figure 3. Next, we transformed the audio into spectrogram images and built a 2D-CNN to train them. However, the results were not encouraging. Hence, we proceeded to extract features from the audio dataset, resulting in a new set of data. This feature set led to the use a 1D-CNN network because the dataset contained numeric values in one-dimension. The Excel chart at the bottom of Figure 3 shows the cross-section of the extracted features.

Seven feature extraction approaches were used to generate the feature dataset, as shown in Figure 3. This feature set was fed into a 1D-CNN for classification. The energy of an audio signal is referred to as the total magnitude or loudness of the signal x(n), and is represented as

$E=∑n∣x(n)∣2.$

The RMS is used to characterize the average of a continuously varying audio signal, and its value represents the total waveform. The RMS energy for each frame of the audio signal is defined as

$RMS=1n∑n∣x(n)∣2,$

where n denotes the number of waveforms in a frame. The frame length was set to 2048, and the hop length was set to 512. The essence of the RMS is enabling the detection of a sudden increase in audio signal energy, which helps to provide a clear distinction between silence or a low signal mode and voiced mode.

The MFCC of a signal is a set of audio signal features that provide a descriptive overview of a spectral envelop [7]. The number of MFCCs was set to 20 and the discrete cosine transform (DCT) type was set to 2. DCT is a function that logs energy from the filter bank and transforms it from the frequency domain to the time domain. In addition, the Mel filter bank estimates the amount of energy that exists at low frequencies.

The Mel frequency, Mel(f), can be defined as follows:

$Mel(f)=2595log10 (f700+1),$

where f is the linear scale frequency, and m is the number of MFCCs. Figure 4 shows an MFCC plot with 20 features in the time series. The x-axis in Figure 4 is measured in seconds, whereas the y-axis is measured in decibels.

According to the authors of [18], the STFT of a signal for time t is computed as

$STFT(ω,τ)=∫-∞∞x(t)ω*(t-τ)e-jωτ dt,$

where ω(tτ) is a window function positioned at τ on the time axis and can be expressed in a Gaussian form.

The spectral centroid identifies the point where the center of mass of an audio sound exists by estimating the weighted mean of the frequencies within the audio signal. In the computation of the spectral centroid, each frame magnitude signal is normalized as follows:

$x˜(n)=∣x(k)∣∑k∣x(k)∣,$

where x(k) represents the discrete audio signal, and f(k) is the frequency at bin k. Hence, the spectral centroid is computed as [1]

$Sc=∑kf(k)x˜(k).$

The spectral bandwidth expresses the spectral spread of an audio signal. The spectral bandwidth at frame t is calculated as [1]

$Sb[t]=(∑k([k,t]-Sc[k]px˜[k,t])1/p,$

where [k, t] is the spectral magnitude in frequency bin k, [k, t] is the frequency in bin k, and Sc[k] is the spectral centroid.

The spectral RollOff frequency is an audio feature that specifies the frequency that contains the majority of the total energy spectrum. According to the Librosa (a Python library for analyzing music and audio) documentation, distinct frames are defined as the center frequency for a spectrogram bin that contains at least 85% of the energy of the spectrum in a frame [19].

The ZCR is the rate of change of the signal sign from positive to negative, or vice versa. The ZCR splits a signal into K frames such that {fi(n) : 1 ≤ iK}, and for each frame, zcri is computed as [1]

$zcri=∑n=1r-1sgn[fi(n)×fi(n-1)],$

where r is the number of samples in each frame and

$sgn(fi(n))={1,if fi(n)>0,0,otherwise.$

These seven features are representative features frequently used in signal processing.

### 3.2 Deep Learning Model and Implementation

We designed a new CNN model to classify snoring data based on multi-feature extraction. To achieve this goal, we utilized TensorFlow, an open-source library developed by Google that primarily focuses on deep learning. It has a comprehensive and flexible ecosystem of tools, libraries, and community resources that allows researchers and developers to easily build and deploy state-of-the-art ML-powered applications. TensorFlow offers multiple levels of abstraction, building, and training models using the high-level Keras API, which makes it easy to get started with TensorFlow and ML. Together with TensorFlow 2.0, Keras has been adopted than any other deep-learning solutions. We used the Keras and TensorFlow libraries for the 1D-CNN and 2D-CNN experiments, respectively. In this study, experiments were conducted using 2D-CNN and 1D-CNN training approaches with different hardware system configurations. We utilized Intel Core i7@3.60 GHz CPU and 32 GB RAM for the 1D-CNN, and another system with a hardware configuration of a single NVIDIA GTX 1080Ti 16 GB GPU, Intel Core i7@3.60 GHz CPU and 32 GB RAM, as in the case of the 2D-CNN. Figure 5 shows conceptual diagrams of the 1D-CNN and 2D-CNN models used in this study.

### 4.1 Experiments and Results

Initially, we experimented by converting our audio dataset into spectrogram images. The resulting images were then sent to a 2D-CNN classifier. A typical architecture of the proposed 2D-CNN classifier is shown in Figure 6. The input size of the classifier was 32×32×3, and the three convolution layers had ReLU activations. In front of each layer, a max-pooling layer reduces the spatial dimension. In addition, the classifier also had two fully connected layers, and batch normalization was used after each of these layers. We applied a softmax activation function to the second fully connected batch normalization layers. The Adam optimizer was also applied to optimize the loss based on the output of the cross-entropy loss function. Both the number of epochs and batch sizes were 100, and the learning rate was 0.0005. Consequently, an accuracy of 86.80% was achieved, as shown in Figure 7. However, their performance was unsatisfactory. Therefore, we attempted to improve the results using the feature extraction process.

We applied multi-feature extraction technology to significantly improve the classification accuracy. At the end of this process, the extracted feature data are forwarded to the 1D-CNN classifier, as shown in Figure 8. The generated feature data consisted of 26 columns and 1,761 rows. Feature data were divided into 80% for training and 20% for testing. In the proposed 1D-CNN shown in Figure 7, the input shape of 26×1 was converted into 13×2. Four convolutional layers were used, with two fully connected layers (density). The first convolutional layer (Conv1D 1) had 256 filters, a kernel size of 2, and a ReLU activation function. In the second convolution layer, 128 filters were used, and the activation function and kernel size were identical to those in the first layer. The second layer is a max-pooling layer with a window size of 2 to reduce the spatial dimensions of the input entering the third convolutional layer. The third and fourth convolution layers (Conv1D 3 and Conv1D 4) used the same filter size of 64, kernel size of 2, and ReLU activation function. After the fourth layer, we introduced a dropout mechanism by setting 50% of the active neurons in the network to zero, which helped prevent an overfitting during feature data training and increased the classification accuracy. Subsequently, the layer was flattened to obtain a fully connected layer (dense 1) with 32 neurons and ReLU activation. The second dense layer is the output layer with two neurons using a softmax activation function. The cross-entropy loss function is defined to calculate the cost using Eq. (10). As a result, we applied the Adam optimizer to minimize the loss. The network batch and epoch sizes were both set to 100.

$cost=-∑i=1nyi log y^i+(1-yi) log(1-y^i),$

where yi and ŷi represents the target label and the estimated value, respectively.

As shown in Figure 7, the proposed network achieved an accuracy of more than 95% after 20 epochs, achieving 99.7% accuracy at the end of 100 epochs. The multi-feature approach used to extract the features can differentiate between snoring and non-snoring data and consequently pave the way for 1D-CNN classifiers to achieve such a high classification accuracy. Moreover, the architecture of the classifier, including the parameter settings, also contributes significantly to the results. Based on these results, a deep learning model was created to detect snoring/OSA with a high classification accuracy in the presence of noise disturbances. Table 1 lists the variation in accuracy when there were few combined features. However, the MFCC had an extremely large impact depending on the results achieved, whereas the spectral features did not significantly affect the classification accuracy. Nonetheless, the combination of these seven features provided the best overall accuracy.

### 4.2 Discussion

Table 2 presents the classification results of the existing research on snoring data [6,911,2023]. From Table 2, it is evident that the multi-feature extraction techniques (ZCR+STFT+MFCC) along with the RNN classifier yielded the best result of 98.8% with one of the largest data sizes [7]. Compared with the highest result of 98.8% in Table 2, it can be seen that the accuracy of the model using the proposed MFCC+ZCR+STFT+Spectral+ RMS and 1D-CNN presented in Table 1 is 99.7%, which is a 0.9% improvement. When considering the error rate instead of the accuracy, the 1.2% error rate was reduced to 0.3%, showing that the error rate was reduced to only one-fourth. This confirms that the proposed method is highly effective in classifying data on snoring, which is a precursor to a sleep apnea transition. By contrast, the absence of a feature extraction technique may seriously affect the result of a classifier and thus reduce the classification accuracy, as cited in [22].

### 5. Conclusion

We presented a multi-feature extraction approach for the classification of snoring sounds using a CNN classifier. The features include the STFT, RMS, spectral centroid, bandwidth, RollOff, ZCR, and MFCC. The features extracted from the sound dataset were used to form a feature dataset for training. Therefore, a 1D-CNN and 2D-CNN were designed to classify the sound dataset using the extracted feature sets. The experiment results showed that the proposed 1D-CNN achieved accuracies of 99.7% for snoring sounds. This proves that the proposed method is useful for identifying snoring sound datasets. Our study provides a multi-feature extraction process that has the potential to extract essential features that can produce a high classification accuracy from audio sound data. The snoring CNN model can be adopted as an alternative and inexpensive method for diagnosing or detecting snoring/OSA. In addition, the snoring data acquisition approach can be adopted to capture audio datasets in the fields of signal processing, medical and musical information retrieval.

In the future, we plan to expand the field of application by applying the proposed method to other datasets, such as speech, environmental, and animal sounds. For example, cat sound classification research has been previously conducted in Korea [24]. However, it is not easy to apply a deep learning approach to the classification of cat sounds owing to data sparsity. In addition, the data collection method and the proposed feature extraction method in a non-face-to-face environment caused by coronavirus disease 2019 (COVID-19) are expected to be used in various ways in the work-from-home environment with sound data.

### Fig 1.

Figure 1.

A generic architecture for sound classification.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 2.

Figure 2.

Overall proposed system architecture for 1D-CNN.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 3.

Figure 3.

Acquisition sequence of snoring data.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 4.

Figure 4.

A graphical view of MFCC feature from a snore signal.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 5.

Figure 5.

Conceptual diagrams of (a) 1D-CNN model and (b) 2D-CNN model.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 6.

Figure 6.

A 2D-CNN architecture using spectrogram images.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 7.

Figure 7.

Classification accuracy for snoring and non-snoring using multi-feature techniques and spectrogram image.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

### Fig 8.

Figure 8.

A 1D-CNN architecture for snoring sound classification.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 1-10https://doi.org/10.5391/IJFIS.2022.22.1.1

Variation in accuracy of features using the proposed 1D-CNN on snoring dataset.

FeatureAccuracy (%)
Spectral (Centroid+Bandwidth+RollOff)50.18
STFT+RMS+Spectral+ZCR60.79
STFT+RMS+Spectral89.46
MFCC99.00
MFCC+ZCR99.00
Spectral+ZCR+MFCC99.20
MFCC+ZCR+STFT+Spectral+RMS99.70

Classification results in existing studies on snoring data.

Feature extraction techniqueClassifierData sizeTest accuracy (%)
SubjectTraining
Demir et al. [20]LBP+HOGSVM-82872.00
Lim et al. [11]ZCR+STFT+MFCCRNN8560098.80
Kang et al. [6]MFCCCNN+LSTM242488.28
Arsenali et al. [9]MFCCRNN20567095.00
Khan [21]MFCCCNN-100096.00
Wang et al. [22]-Dual CNN+GRU-82863.80
Tuncer et al. [10]PTT signal+AlexNet+VGG16SVM+KNN10010092.78
Dalal and Triggs [23]SCAT+GMM+MAPMLP22428267.71

GRU, gated recurrent unit; SCAT, deep scattering spectrum; GMM, Gaussian mixture model; MAP, maximum a posteriori; MLP, multilayer perceptron.

### References

1. Li, FF, and Cox, TJ (2019). Digital Signal Processing in Audio and Acoustical Engineering. Boca Raton, FL: CRC Press
2. Khamparia, A, Gupta, D, Nguyen, NG, Khanna, A, Pandey, B, and Tiwari, P (2019). Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access. 7, 7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882
3. Parra, O, Arboix, A, Montserrat, JM, Quinto, L, Bechich, S, and Garcia-Eroles, L (2004). Sleep-related breathing disorders: impact on mortality of cerebrovascular disease. European Respiratory Journal. 24, 267-272. https://doi.org/10.1183/09031936.04.00061503
4. Zhang, X, and Qiu, X (2017). Performance of a snoring noise control system based on an active partition. Applied Acoustics. 116, 283-290. https://doi.org/10.1016/j.apacoust.2016.09.032
5. Okuno, K, Furuhashi, A, Nakamura, S, Suzuki, H, Arisaka, T, and Taga, H (2019). Japanese cross-sectional multicenter survey (JAMS) of oral appliance therapy in the management of obstructive sleep apnea. International Journal of Environmental Research and Public Health. 16. article no. 3288
6. Kang, B, Dang, X, and Wei, R . Snoring and apnea detection based on hybrid neural networks., Proceedings of 2017 International Conference on Orange Technologies (ICOT), 2017, Singapore, Array, pp.57-60. https://doi.org/10.1109/ICOT.2017.8336088
7. Adesuyi, TA, Kim, BM, and Shin, YS (2020). A brief on snoring data and classification methods. International Journal of Advanced Trends in Computer Science and Engineering. 9, 426-432. https://doi.org/10.30534/ijatcse/2020/59912020
8. Amiriparian, S, Gerczuk, M, Ottl, S, Cummins, N, Freitag, M, Pugachevskiy, S, Baird, A, and Schuller, B . Snore sound classification using image-based deep spectrum features., Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017, Stockholm, Sweden, Array, pp.3512-3516. https://doi.org/10.21437/interspeech.2017-434
9. Arsenali, B, van Dijk, J, Ouweltjes, O, den Brinker, B, Pevernagie, D, Krijn, R, van Gilst, M, and Overeem, S . Recurrent neural network for classification of snoring and non-snoring sound events., Proceedings of 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2018, Honolulu, HI, Array, pp.328-331. https://doi.org/10.1109/EMBC.2018.8512251
10. Tuncer, SA, Akılotu, B, and Toraman, S (2019). A deep learning-based decision support system for diagnosis of OSAS using PTT signals. Medical Hypotheses. 127, 15-22. https://doi.org/10.1016/j.mehy.2019.03.026
11. Lim, SJ, Jang, SJ, Lim, JY, and Ko, JH (2019). Classification of snoring sound based on a recurrent neural network. Expert Systems with Applications. 123, 237-245. https://doi.org/10.1016/j.eswa.2019.01.020
12. Arasi, MA, and Babu, S (2019). Survey of machine learning techniques in medical imaging. International Journal of Advanced Trends in Computer Science and Engineering. 8, 2107-2116. https://doi.org/10.30534/ijatcse/2019/39852019
13. Huang, J, Wei, Y, Yi, J, and Liu, M . An improved kNN based on class contribution and feature weighting., Proceedings of 2018 10th international conference on measuring technology and mechatronics automation (ICMTMA), 2018, Changsha, China, Array, pp.313-316. https://doi.org/10.1109/ICMTMA.2018.00083
14. Kiranyaz, S, Avci, O, Abdeljaber, O, Ince, T, Gabbouj, M, and Inman, DJ. (2019) . 1D convolutional neural networks and applications: a survey. [Online]. Available: https://arxiv.org/abs/1905.03554
15. McFee, B, Raffel, C, Liang, D, Ellis, DP, McVicar, M, Battenberg, E, and Nieto, O . librosa: audio and music signal analysis in Python., Proceedings of the 14th Python in Science Conference, 2015, Array, pp.18-25. https://doi.org/10.25080/majora-7b98e3ed-003
16. Faizy, Jal. (2017) . 10 Advanced deep learning architectures data scientists should know!. [Online]. Available: https://www.analyticsvidhya.com/blog/2017/08/10-advanced-deep-learning-architectures-data-scientists/
17. Adesuyi, TA 2020. A Convolutional Neural Network Model for Sound Classification Based on Multi-Feature Extraction. PhD dissertation. Department of Software Engineering, Kumoh National University of Technology. Korea.
18. Yu, S, You, X, Ou, W, Jiang, X, Zhao, K, Zhu, Z, Mou, Y, and Zhao, X (2016). STFT-like time frequency representations of nonstationary signal with arbitrary sampling schemes. Neurocomputing. 204, 211-221. https://doi.org/10.1016/j.neucom.2015.08.130
19. Demir, F, Sengur, A, Cummins, N, Amiriparian, S, and Schuller, B (). Low level texture features for snore sound discrimination, 413-416. https://doi.org/10.1109/EMBC.2018.8512459
20. Khan, T (2019). A deep learning model for snoring detection and vibration notification using a smart wearable gadget. Electronics. 8. article no. 987
21. Wang, J, Stromfeli, H, and Schuller, BW . A CNN-GRU approach to capture time-frequency pattern interdependence for snore sound classification., Proceedings of 2018 26th European Signal Processing Conference (EUSIPCO), 2018, Rome, Italy, Array, pp.997-1001. https://doi.org/10.23919/EUSIPCO.2018.8553521
22. Dalal, N, and Triggs, B . Histograms of oriented gradients for human detection., Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005, San Diego, CA, Array, pp.886-893. https://doi.org/10.1109/CVPR.2005.177
23. Pandeya, YR, and Lee, J (2018). Domestic cat sound classification using transfer learning. International Journal of Fuzzy Logic and Intelligent Systems. 18, 154-160. https://doi.org/10.5391/IJFIS.2018.18.2.154