Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 399-408

Published online December 25, 2023

https://doi.org/10.5391/IJFIS.2023.23.4.399

© The Korean Institute of Intelligent Systems

WSOLA for Reconstruction of Prolonged Speech Signal

K. B. Drakshayini1, M. A. Anusuya2, and H. Y. Vani3

1Visvesvaraya Technological University (VTU), Belgaum, India
2Department of Computer Science and Engineering, JSS Science and Technological University, Mysuru, India
3Department of Information Science and Engineering, JSS Science and Technological University, Mysuru, India

Correspondence to :
K. B. Drakshayini (drakshakb@gmail.com)

Received: July 5, 2023; Accepted: September 1, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Stuttering is one of the most common fluency disorders across all age groups. In this work, we propose a novel approach for reconstructing speech signals after prolongation detection and correction phase by applying waveform similarity overlap-add (WSOLA). Further processing of speech signals after prolongation detection and correction ensures sufficient signal continuity at segment joins by requiring maximal similarity to the natural continuity of the input signal. This work presents a major contribution towards improving the quality of speech signals after prolongation detection and phase correction, with further processing of speech signals for feature extraction followed by classification and phase clustering. The results are analysed in WSOLA phase for differentiating results before and after reconstruction using metrics such as the signal-to-noise ratio and total harmonic distortion. The WSOLA results are analysed with respect to different parameters such as the window time and overlap ratio. Moreover, the proposed method is implemented on a wide variety of signals derived from the University College London Archive of Stuttered Speech (UCLASS) dataset. A recognition accuracy of 95% was observed for output signals processed using WSOLA and applied over the K-means, fuzzy C-mean and support vector machine classifiers.

Keywords: WSOLA, Prolongation detection and correction, Stuttered speech recognition, Cross correlation

The efficiency of speech communication depends on fluency of speech delivery. Speech disorders are common communication impediments defined as any condition where a person has difficulty in forming speech sounds during communication [1, 2]. Among the various speech disorder, the cause for stuttering is unclear, and theories generally highlight factors such as genetic disposition, psychological factors, biological explanation, and family issues [3, 4]. For stuttered speech recognition systems using prolonged speech signals, it is required to further enhance the quality of the speech signals even after prolongation detection and correction [510]. In this work waveform similarity overlap-add (WSOLA) was applied before applying classification/clustering methods for stuttered speech. WSOLA could adjust the level of the audio segments for overlap and add them in order to maintain a consistent audio signal level. Then, the reconstructed signal was applied over Mel-frequency cepstral coefficients (MFCC) feature extraction and different decision-making algorithms such as K-means, fuzzy C-means (FCM), and support vector machine (SVM).

The rest of this paper is organized as follows: Section 2 describes the methodology adopted, Section 3 discusses the validation of the proposed method, Section 4 discusses the decision-making phase, and Section 5 discusses the dataset. The results and observations are discussed in Section 6, Section 7 discusses the challenges, and Section 8 concludes the work.

1.1 Waveform Similarity Overlap-Add (WSOLA)

Algorithms for high quality time-scale modification of speech are important for further smoothing of stuttered speech signals, where the potential of controlling the apparent speaking rate is a desirable feature. The WSOLA algorithm produces high quality speech output, is algorithmically and computationally efficient, and robust. The principle of WSOLA depends on wave similarity, and it conducts operations in the time domain [11].

WSOLA was designed based on the OLA waveform editing techniques [1215]. If τ (n) represents the desired time warping function, the basic OLA strategy comprises excising segments at time instants τ−1(Lk) from the input signal x(n), shifting them to time instants Lk, and adding them together to form the output signal y(n).

However, when constructing the output signal in this manner, the individual segments are added incoherently. This introduces irregularities and distortions in the time-scaled result.

In contrast, WSOLA introduces a tolerance Δk on the desired time-warping function to ensure signal continuity at segment joins [12].

Considering propagation in the left to right direction in the waveforms in Figure 1, we assume that the segment (S1) was last excised from the input signal and added to the output in position (a). We then need to find a segment (S2), located approximately at the time instant τ−1(Lk) in the input signal, which will produce a natural continuation of the output signal when added in position (b). As (S1’) would add to (S1) (a) in a natural manner to reconstruct a portion of the original input signal, we select (b) such that it resembles (S1’) as closely as possible, and is located within the prescribed tolerance interval around τ−1(Lk) in the input wave. The position of this best segment (S2) is identified by maximizing a similarity measure (such as the cross-correlation or the cross-AMDF) between the sample sequence underlying (1’) and the input signal. After excising (S2) and adding it in position (b), we can proceed to the next output segment, where (S2’) now plays the same role as (S1’) in the previous step.

1.2 Literature Work

Existing works related to time-scale modification (TSM) and its variants in speech recognition and their applications are tabulated in Table 1 [5, 14, 1619].

The stuttered speech recognition process is shown in Figure 2. Stuttered speech signals are fed through pre-processing, then through a hybrid method for prolongation detection and correction, then reconstructed using WSOLA and processed through feature extraction and decision-making for stuttered speech recognition.

Stuttered speech recognition process with intermission of prolongation detection, correction, and WSOLA proceeds as follows

Step 1: Read stuttered speech signal with stop gap Stutt redness (8 kHz)

Step 2: Pre-emphasis is performed by filtering the speech signal with first-order finite impulse response (FIR) filter

Step 3: Prolongation detection and correction using spectral parameters

  • • The signal is divided into frames of 200-ms duration.

  • • Compute short-term energy, zero-crossing rate (ZCR), spectral entropy and centroid parameters for each frame for the complete signal strength.

  • • Autocorrelation function is applied to compute the similarity between the adjacent frames for all the parameters.

  • • Every parameter is identified and fixed with the threshold values.

  • • Average autocorrelation values for the parameters are computed.

  • • If the autocorrelation value between the adjacent frames is higher than the threshold, frames are retained by considering normal speech segments, else it is identified as a prolonged frame.

  • • Frames obtained in previous steps are identified as un-prolonged frames and used for reconstructing the speech signal.

Step 4: Further reconstruction of the signal using WSOLA

  • • Input audio signal x with analysis frame xm. The output signal y is constructed iteratively.

  • • Application of Hann window function, w, to the analysis frame xm resulting in the synthesis frame ym.

  • • The next analysis frame xm+1 having a specified distance of Ha of 0.04-ms samples from xm and retrieve the adjusted frame xm+1 as the frame in ym+1 whose waveform is most similar to xm.

  • • The similarity is calculated using the autocorrelation function.

  • • Overlap-add using the specified synthesis hop size (Hs) 0.01 ms and overlap of 0.5 ms.

Step 5: Feature extraction is performed to convert speech signals to parametric representations for further analysis using MFCC [20, 21].

  • • The signal is divided into frames of 20 ms

  • • Multi-tapering windowing (number of tapers: 5)

  • • For each frame, apply DFT to convert signals from time domain to frequency domain

  • • Mel-scale filter bank analysis

  • • Log to obtain smaller components of a signal

  • • DCT is applied to extract spectral features

Step 6: Clustering using K-means/FCM/SVM

  • • Signal-to-noise ratio (SNR): A low-level of SNR will decrease how accurately the system can recognize speech. It is the ratio of the summed squared magnitude of the signal to the summed squared magnitude of the noise [22].

    SNR=μ2σ2.

    Here, μ is the signal mean or expected value and σ is the standard deviation of the noise.

  • • Total harmonic distortion (THD): The THD is a measure of the harmonic distortion present in a signal, and is defined as the ratio of the sum of the powers of all harmonic components to the power of the fundamental frequency [23].

    THD=V22+V32+V42+V1,

    where Vn is the root mean square (RMS) value of the nth harmonic voltage and V1 is the RMS value of the fundamental component.

We considered two clustering methods and one classifier to calculate the recognition performance of the reconstructed signal obtained from the proposed hybrid method. The MFCC features were modelled using the K-means, FCM, and SVM classifiers. These are briefly described below and parameters considered are tabulated in Table 2.

K-means: It is a method of vector quantization that aims to partition ‘N’ observations into K clusters in which each observation belongs to the cluster with the nearest mean. The K-means algorithm is used to generate a vector quantization codebook for data compression. Each of clusters is defined by its central vector or centroid. According to the Euclidian distance function, the K-means algorithm clusters the data in to K groups that assigns objects to their closest cluster [10].

FCM: This method was introduced with the idea of a fuzzification parameter that determines the degree of fuzziness in clusters. This algorithm works by assigning membership to each data point corresponding to each cluster centre on the basis of the distance between the cluster centre and data point [24]. The performance of the system depends on different inputs and environments such as centroids, the fuzziness parameter-m, termination criteria, norm of the matrix, and partition matrix-U. The different ’m’ values considered are 1.5, 1.75, and 2.

SVM: SVM is a powerful machine learning tool that attempts to obtain a good separating hyper-plane between two classes in the higher dimensional space. It is a predominant technique to estimate the basic parameters of speech. Speech samples can be approximated as a linear combination of samples by minimizing the sum of squared difference between the actual speech samples and predicted values. Thus, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for the linear prediction of speech samples. The importance of this method lies in its stability to provide extremely accurate estimates of speech parameters from data samples, and may have applications in understanding nonlinear phase characteristics of stuttered speech [2528].

The experiments were conducted on the University College London Archive of Stuttered Speech (UCLASS) dataset. This repository comprises monologues, readings, and conversational recordings. For our simulation, 80 samples at word level were derived from the sentence recordings of the repository. It includes 22 words from female speakers and 58 words from male speakers with age ranging from 11 years to 20 years. The samples are selected to cover different ages and genders [27]. Through listening perception, vocalized, and nonvocalized prolongations are identified and derived manually. The dataset considered for analysis is shown in Table 3.

WSOLA was applied over a speech signal uttered as ‘mmmm/ooo/thers d/aaaaaaa/y’ (actual word mother’s day) and reconstructed using different scaling factors from 0.1 to 2.0. As per the observation shown in Figure 3, as we decrease the scaling factor, it increases the speech rate. Increasing the scaling factor will speed down the speech rate and analysis of waveforms, as shown in Figure 4. As per the observations shown in Table 4, by varying the window time, the overlap function serves as the reference for natural signal continuity and leads to better performance with reduced SNR and THD.

The WSOLA reconstruction efficiency was analysed for artifacts in three categories of speech: vowel-like, consonant like, and plosive sound, and the results are shown in Table 5. As per the observation for different age groups and sexes for vowel and consonant sounds, WSOLA results in better performance compared to plosive sound, because it increases the sharpness of harmonics and eliminates the discontinuous of the signal by tapering the beginning and end of the frame to zero. It also reduces the spectral distortion due to the overlap.

After further smoothening of the speech signal, it passes through the MFCC and decision-making algorithm using K-means, FCM, and SVM to analyse the effect of WSOLA on the performance of stuttered speech recognition. Improved recognition rate compared to that without WSOLA was observed. A higher recognition rate of 95% is observed using SVM, as shown in Table 6 and Figure 5.

Selecting the proper scaling rate for prolonged highly polyphonic speech signals is a challenge. Different types of speech signals require WSOLA to be processes with different scaling factors for better performance. Values fixed with the parameters considered for reconstruction also vary depending on the length and type of artifacts present in the speech signal.

This work proposes a method for smoothening speech signals after prolongation detection and correction using WSOLA. Results demonstrated 95% recognition rate using SVM applied after WSOLA signal smoothening. Additionally, this work can be extended to improve WSOLA utilization by identifying proper methods for selecting better scaling rates that apply to all prolongations with mixed phonemes.

Fig. 1.

WSOLA process illustration.


Fig. 2.

Stuttered speech recognition process with WSOLA.


Fig. 3.

Reconstructed speech signal ‘mmmm/ooo/thers d/aaaaaaa/y’ using WSOLA for different scaling factors.


Fig. 4.

Waveforms of speech signal after prolongation detection, correction, and applying WSOLA: (a) original speech signal, (b) speech signal without prolongation, and (c) speech signal after WSOLA.


Fig. 5.

Analysis of speech recognition accuracy with and without WSOLA.


Table. 1.

Table 1. Existing work on TSM and its variants in speech recognition and their application.

StudyVariant of TSMApplicationDatasetResult
[17]OLA, SOLA, WSOLATime-scale modification of speechSpeech signalHigh-quality speech signal
[18]WSOLAAudio packet loss concealmentSpeech signalIncreased sound quality
[14]WSOLAComputer assisted language learning applicationsSpeech signalAnalyse speech in different time scales for desired applications
[16]OLA, SOLAEnhancement of speech recognition in case of diminished time-resolution hearing impairmentSpeech signalLow complexity and high quality of the processed signal
[5]PSOLAReconstruction of smooth speech signals from stuttered speechStuttered speech signalIncreased sound quality
[19]WSOLATime-scale modification of music signalsMusic signalAnalysis of artifacts and importance of parameter choices for desired applications

Table. 2.

Table 2. Parameter setting for decision-making and its variants.

ParameterValue
K-means
 Number of clustersK = 10
 Distance measureEuclidian

FCM
 Fuzzifier1.39 (experimented for values from 1.5–4)
 Distance measureEuclidian

SVM
 Regularization parameter (C)1.0
 KernalPoly (experimented for linear, poly, rbf, sigmoid, precomputed)
 Degree3
 GammaAuto

Table. 3.

Table 3. Dataset considered for experiment.

SL.NOActual wordPronounced wordAge (yr)SexTypes of soundProlongation length (mm)Derived from(.wav files)
D1BallB/aaaa/ll11FVowel20F_0142_11y3m_1
D2DifferentD/iiiiiiiiii/ffernt11F10F_0142_11y7m_1
D3climbedCl/iiiiiiiii/mbed8M10M_0016_1_08y7m_1
D4DepartmentDepaaaaaartment14F30F_0101_14y8m_2
D5TuesdayTueeeeesday16M10M_0104_17y1m_1
D6justJu/ssssssss/t11FConsonant20F_0142_11y7m_1
D7EveryE/vvvv/ery16M20M_0016_07y11m_1
D8understandingUnders/tttttt/anding14F30F_0101_14y8m_2
D9moneyMo/nnnnn/ey11F30F_0142_11y7m_1
D10FindingFin/dddddd/ing12M20M_1206_12y3m_1
D11MomentMomen/tttttt/11FPlossive10F_0142_11y3m_1
D12startedStarte/dddd/11F30F_0142_11y3m_1
D13stepSte/pppppppp/10M30M_1202_10y11m_1
D14partPar/tttt/7M10M_0016_07y11m_1
D15calledCalle/ddddd/8M10M_0016_08y3m_1

Table. 4.

Table 4. Analysis of WSOLA for the signal ‘mmmm/ooo/thers d/aaaaaaa/y’ with varying window times and overlap ratios.

Win timeOverlap_ratioSNR (dB)THD (dBc)
0.0250.5−4.438380959−9.138215347
0.026−3.005498556−8.735852625
0.027−3.370656333−7.979020044
0.028−3.222105032−10.80162906
0.03−3.508334592−9.41019937
0.031−2.34569927−9.221703967
−1.957414329−8.942617797
0.035−3.119797024−6.998573144
0.036−3.363501115−5.087157297
0.038−4.137424842−6.805353737

Table. 5.

Table 5. Analysis of WSOLA for different types of speech.

Categories of speechActual wordPronounced wordAge (yr)SexBefore WSOLAAfter WSOLA


SNR (dB)THD (dBc)SNR (dB)THD (dBc)
Vowel soundKid/k/iiiiiii/d9Female−4.4384−9.1382−3.5083−9.4102
Consonant soundFineFi/nnnnn/e−5.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790−6.0148−6.8973

Vowel soundKid/k/iiiiiii/d7Male−5.4384−9.1382−4.5083−9.4102
Consonant soundFineFi/nnnnn/e−6.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790−5.0148−6.8973

Vowel soundKid/k/iiiiiii/d35Male−8.4384−9.1382−7.5083−9.4102
Consonant soundFineFi/nnnnn/e−5.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790-6.0148−6.8973

Vowel soundKid/k/iiiiiii/d25Female−4.4384−9.1382−3.5083−9.4102
Consonant soundFineFi/nnnnn/e−5.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790−6.0148−6.8973

Table. 6.

Table 6. Effect of WSOLA on decision-making.

Decision-makingCategoriesRecognition rate (%)
Without WSOLAWith WSOLA
K-MeansVowel sound78.1385.14
Consonant sound79.1586.12
Plosive sound76.1984.15

FCMVowel sound85.3588.34
Consonant sound87.1288.34
Plosive sound86.1487.45

SVMVowel sound93.1295.37
Consonant sound94.1295.17
Plosive sound93.1496.12

  1. Hermansky, H (2011). Speech recognition from spectral dynamics. Sadhana. 36, 729-744. https://doi.org/10.1007/s12046-011-0044-2
    CrossRef
  2. Vashisht, V, Pandey, AK, and Yadav, SP (2021). Speech recognition using machine learning. IEIE Transactions on Smart Processing & Computing. 10, 233-239. https://doi.org/10.5573/IEIESPC.2021.10.3.233
    CrossRef
  3. Erickson, S, and Block, S (2013). The social and communication impact of stuttering on adolescents and their families. Journal of Fluency Disorders. 38, 311-324. https://doi.org/10.1016/j.jfludis.2013.09.003
    Pubmed CrossRef
  4. Dash, A, Subramani, N, Manjunath, T, Yaragarala, V, and Tripathi, S . Speech recognition and correction of a stuttered speech., Proceedings of 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 2018, pp.1757-1760. https://doi.org/10.1109/ICACCI.2018.8554455
    CrossRef
  5. Deshmukh, OD, Sheth, SS, and Verma, A. Reconstruction of a smooth speech signal from a stuttered speech signal. US Patent No. 8,600,758, Dec 13, 2013. https://patents.google.com/patent/US8600758B2/en
  6. Barczewska, K, and Igras, M (2013). Detection of disfluencies in speech signal. Challenges of Modern Technology. 4, 3-10.
  7. Manjula, G, Kumar, MS, Geetha, YV, and Kasar, T (2017). Identification and validation of repetitions/prolongations in stuttering speech using epoch features. International Journal of Applied Engineering Research. 12, 11976-11980.
  8. Alharbi, S, Hasan, M, Simons, AJ, Brumfitt, S, and Green, P . A lightly supervised approach to detect stuttering in children’s speech., Proceedings of Interspeech 2018, Hyderabad, India, 2018, pp.3433-3437. https://doi.org/10.21437/Interspeech.2018-2155
    CrossRef
  9. Suszynski, W, Kuniszyk-Jozkowiak, W, Smołka, E, and Dzienkowski, M (2003). Prolongation detection with application of fuzzy logic. Annales UMCS Informatica AI–Informatica. 1, 1-8. https://journals.umcs.pl/ai/article/view/2929
  10. Drakshayini, KB, and Anusuya, MA (2023). Repetition detection using spectral parameters and multi tapering features. Indian Journal of Computer Science and Engineering. 14, 684-696. https://doi.org/10.21817/indjcse/2023/v14i4/231404068
    CrossRef
  11. Yeh, JF, Lin, PC, Kuo, MD, and Hsu, ZH (2013). Bilateral waveform similarity overlap-and-add based packet loss concealment for voice over IP. Journal of Applied Research and Technology. 11, 559-567.
    CrossRef
  12. Chalamandaris, A, Tsiakoulis, P, Karabetsos, S, and Raptis, S. An efficient and robust pitch marking algorithm on the speech waveform for TD-PSOLA., Proceedings of 2009 IEEE International Conference on Signal and Image Processing Applications, Kuala Lumpur, Malaysia, 2009, pp.397-401. https://doi.org/10.1109/ICSIPA.2009.5478685
    CrossRef
  13. Rudresh, S, Vasisht, A, Vijayan, K, and Seelamantula, CS. (2018) . Epoch-synchronous overlap-add (ESOLA) for time-and pitch-scale modification of speech signals. [Online]. Available: https://arxiv.org/abs/1801.06492
  14. Howell, P, Davis, S, and Bartrip, J (2009). The University College London Archive of Stuttered Speech (UCLASS). Journal of Speech, Language, and Hearing Research. 52, 556-569. https://doi.org/10.1044/1092-4388(2009/07-0129)
    CrossRef
  15. Demol, M, Struyve, K, Verhelst, W, Paulussen, H, Desmet, P, and Verhoeve, P. Efficient non-uniform time-scaling of speech with WSOLA for CALL applications., Proceedings of InSTIL/ICALL Symposium, Venice, Italy, 2004, pp.1-4. https://www.isca-speech.org/archive/pdfs/icall_2004/demol04_icall.pdf
  16. Kupryjanow, A, and Czyzewski, A. Time-scale modification of speech signals for supporting hearing impaired schoolchildren., Proceedings of the Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 2009, pp.159-162.
  17. Verhelst, W, and Roelands, M . An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech., Proceedings of 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN, USA, 1993, pp.554-557. https://doi.org/10.1109/ICASSP.1993.319366
    CrossRef
  18. Stenger, A, Younes, KB, Reng, R, and Girod, B . A new error concealment technique for audio transmission with packet loss., Proceedings of 1996 8th European Signal Processing Conference (EUSIPCO), Trieste, Italy, 1996, pp.1-4.
  19. Driedger, J, and Muller, M (2016). A review of time-scale modification of music signals. Applied Sciences. 6. article no 57. https://doi.org/10.3390/app6020057
    CrossRef
  20. Chee, LS, Ai, OC, Hariharan, M, and Yaacob, S . MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA., Proceedings of 2009 IEEE Student Conference on Research and Development (SCOReD), Serdang, Malaysia, 2009, pp.146-149. https://doi.org/10.1109/SCORED.2009.5443210
    CrossRef
  21. Ravikumar, K, Rajagopal, R, and Nagaraj, H (2009). An approach for objective assessment of stuttered speech using MFCC. Digital Signal Processing Journal. 9, 19-24. http://www.icgst.com/paper.aspx?pid=P1180852542#
  22. Alencar, PBAD, Lucas, PDA, De Bortoli, E, Bernert, LM, Rodrigues, LP, and Branco-Barreiro, FCA (2020). Acoustically controlled auditory training in children with speech disfluency: a case report. Revista CEFAC. 22. article no. e5420. https://doi.org/10.1590/1982-0216/20202265420
    CrossRef
  23. Wang, H. (2020). Measurement of total harmonic distortion (THD) and its related parameters using multi-instrument. [Online]. Available: https://www.researchgate.net/publication/343107118_Measurement_of_Total_Harmonic_Distortion_THD_and_Its_Related_Parameters_using_Multi-Instrument
  24. Solera-Urena, R, Padrell-Sendra, J, Martin-Iglesias, D, Gallardo-Antolin, A, Pelaez-Moreno, C, and Diaz-de-Maria, F. (2007). SVMs for automatic speech recognition: a survey. Progress in Nonlinear Speech Processing, 190-216. https://doi.org/10.1007/978-3-540-71505-4_11
    CrossRef
  25. Besbes, S, and Lachiri, Z . Multi-class SVM for stressed speech recognition., Proceedings of 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2016, Monastir, Tunisia, Array, pp.782-787. https://doi.org/10.1109/ATSIP.2016.7523188
    CrossRef
  26. Kourkounakis, T, Hajavi, A, and Etemad, A. (2020) . FluentNet: end-to-end detection of speech disfluency with deep learning. [Online]. Available: https://arxiv.org/abs/2009.11394
  27. Garg, S, Mehrotra, U, Krishna, G, and Vuppala, AK (2020). Transfer learning based disfluency detection using stuttered speech. Speech Processing Laboratory, International Institute of Information Technology. Hyderabad, India.
  28. Sheikh, SA, Sahidullah, M, Hirsch, F, and Ouni, S . StutterNet: stuttering detection using time delay neural network., Proceedings of 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021, pp.426-430. https://doi.org/10.23919/EUSIPCO54536.2021.9616063
    CrossRef

K. B. Drakshayini is a research scholar of VTU Belgaum working under the guidance of Dr. M. A. Anusuya on stuttering speech signal processing. Completed M.Tech. in NIE, Mysore and bachelor degree in vidya Vardhaka college of Engineering. She has total of 15 years of experience in teaching in Vidya Vikas institute of Engineering. Published Papers in National/International journals and conferences in research field. Area of interest are speech signal processing, data science, machine learning.

M. A. Anusuya is having M.Tech. and Ph.D. qualification in Computer Science and Engineering with specific research interest in the field of Speech signal processing. She has total 25 years of teaching experience and published around 60 papers in international/national journals and Conferences with special recognitions. Presently working as an associate professor in JSS Science and Technology University, Mysore. Area of interest are pattern recognition, speech signal processing, machine learning, machine translation, fuzzy based mathematical modelling.

H. Y. Vani is having M.Tech. and Ph.D. qualification in Computer Science and Engineering with specific research interest in the field of Speech signal processing. She has total 25 years of teaching experience and published around 20 papers in international/national journals and conferences with special recognitions. Presently working as an associate professor in JSS Science and Technology University, Mysore. Area of interest are pattern recognition, speech signal processing, machine learning, fuzzy logic.

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 399-408

Published online December 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.4.399

Copyright © The Korean Institute of Intelligent Systems.

WSOLA for Reconstruction of Prolonged Speech Signal

K. B. Drakshayini1, M. A. Anusuya2, and H. Y. Vani3

1Visvesvaraya Technological University (VTU), Belgaum, India
2Department of Computer Science and Engineering, JSS Science and Technological University, Mysuru, India
3Department of Information Science and Engineering, JSS Science and Technological University, Mysuru, India

Correspondence to:K. B. Drakshayini (drakshakb@gmail.com)

Received: July 5, 2023; Accepted: September 1, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Stuttering is one of the most common fluency disorders across all age groups. In this work, we propose a novel approach for reconstructing speech signals after prolongation detection and correction phase by applying waveform similarity overlap-add (WSOLA). Further processing of speech signals after prolongation detection and correction ensures sufficient signal continuity at segment joins by requiring maximal similarity to the natural continuity of the input signal. This work presents a major contribution towards improving the quality of speech signals after prolongation detection and phase correction, with further processing of speech signals for feature extraction followed by classification and phase clustering. The results are analysed in WSOLA phase for differentiating results before and after reconstruction using metrics such as the signal-to-noise ratio and total harmonic distortion. The WSOLA results are analysed with respect to different parameters such as the window time and overlap ratio. Moreover, the proposed method is implemented on a wide variety of signals derived from the University College London Archive of Stuttered Speech (UCLASS) dataset. A recognition accuracy of 95% was observed for output signals processed using WSOLA and applied over the K-means, fuzzy C-mean and support vector machine classifiers.

Keywords: WSOLA, Prolongation detection and correction, Stuttered speech recognition, Cross correlation

1. Introduction

The efficiency of speech communication depends on fluency of speech delivery. Speech disorders are common communication impediments defined as any condition where a person has difficulty in forming speech sounds during communication [1, 2]. Among the various speech disorder, the cause for stuttering is unclear, and theories generally highlight factors such as genetic disposition, psychological factors, biological explanation, and family issues [3, 4]. For stuttered speech recognition systems using prolonged speech signals, it is required to further enhance the quality of the speech signals even after prolongation detection and correction [510]. In this work waveform similarity overlap-add (WSOLA) was applied before applying classification/clustering methods for stuttered speech. WSOLA could adjust the level of the audio segments for overlap and add them in order to maintain a consistent audio signal level. Then, the reconstructed signal was applied over Mel-frequency cepstral coefficients (MFCC) feature extraction and different decision-making algorithms such as K-means, fuzzy C-means (FCM), and support vector machine (SVM).

The rest of this paper is organized as follows: Section 2 describes the methodology adopted, Section 3 discusses the validation of the proposed method, Section 4 discusses the decision-making phase, and Section 5 discusses the dataset. The results and observations are discussed in Section 6, Section 7 discusses the challenges, and Section 8 concludes the work.

1.1 Waveform Similarity Overlap-Add (WSOLA)

Algorithms for high quality time-scale modification of speech are important for further smoothing of stuttered speech signals, where the potential of controlling the apparent speaking rate is a desirable feature. The WSOLA algorithm produces high quality speech output, is algorithmically and computationally efficient, and robust. The principle of WSOLA depends on wave similarity, and it conducts operations in the time domain [11].

WSOLA was designed based on the OLA waveform editing techniques [1215]. If τ (n) represents the desired time warping function, the basic OLA strategy comprises excising segments at time instants τ−1(Lk) from the input signal x(n), shifting them to time instants Lk, and adding them together to form the output signal y(n).

However, when constructing the output signal in this manner, the individual segments are added incoherently. This introduces irregularities and distortions in the time-scaled result.

In contrast, WSOLA introduces a tolerance Δk on the desired time-warping function to ensure signal continuity at segment joins [12].

Considering propagation in the left to right direction in the waveforms in Figure 1, we assume that the segment (S1) was last excised from the input signal and added to the output in position (a). We then need to find a segment (S2), located approximately at the time instant τ−1(Lk) in the input signal, which will produce a natural continuation of the output signal when added in position (b). As (S1’) would add to (S1) (a) in a natural manner to reconstruct a portion of the original input signal, we select (b) such that it resembles (S1’) as closely as possible, and is located within the prescribed tolerance interval around τ−1(Lk) in the input wave. The position of this best segment (S2) is identified by maximizing a similarity measure (such as the cross-correlation or the cross-AMDF) between the sample sequence underlying (1’) and the input signal. After excising (S2) and adding it in position (b), we can proceed to the next output segment, where (S2’) now plays the same role as (S1’) in the previous step.

1.2 Literature Work

Existing works related to time-scale modification (TSM) and its variants in speech recognition and their applications are tabulated in Table 1 [5, 14, 1619].

2. Methodology

The stuttered speech recognition process is shown in Figure 2. Stuttered speech signals are fed through pre-processing, then through a hybrid method for prolongation detection and correction, then reconstructed using WSOLA and processed through feature extraction and decision-making for stuttered speech recognition.

Stuttered speech recognition process with intermission of prolongation detection, correction, and WSOLA proceeds as follows

Step 1: Read stuttered speech signal with stop gap Stutt redness (8 kHz)

Step 2: Pre-emphasis is performed by filtering the speech signal with first-order finite impulse response (FIR) filter

Step 3: Prolongation detection and correction using spectral parameters

  • • The signal is divided into frames of 200-ms duration.

  • • Compute short-term energy, zero-crossing rate (ZCR), spectral entropy and centroid parameters for each frame for the complete signal strength.

  • • Autocorrelation function is applied to compute the similarity between the adjacent frames for all the parameters.

  • • Every parameter is identified and fixed with the threshold values.

  • • Average autocorrelation values for the parameters are computed.

  • • If the autocorrelation value between the adjacent frames is higher than the threshold, frames are retained by considering normal speech segments, else it is identified as a prolonged frame.

  • • Frames obtained in previous steps are identified as un-prolonged frames and used for reconstructing the speech signal.

Step 4: Further reconstruction of the signal using WSOLA

  • • Input audio signal x with analysis frame xm. The output signal y is constructed iteratively.

  • • Application of Hann window function, w, to the analysis frame xm resulting in the synthesis frame ym.

  • • The next analysis frame xm+1 having a specified distance of Ha of 0.04-ms samples from xm and retrieve the adjusted frame xm+1 as the frame in ym+1 whose waveform is most similar to xm.

  • • The similarity is calculated using the autocorrelation function.

  • • Overlap-add using the specified synthesis hop size (Hs) 0.01 ms and overlap of 0.5 ms.

Step 5: Feature extraction is performed to convert speech signals to parametric representations for further analysis using MFCC [20, 21].

  • • The signal is divided into frames of 20 ms

  • • Multi-tapering windowing (number of tapers: 5)

  • • For each frame, apply DFT to convert signals from time domain to frequency domain

  • • Mel-scale filter bank analysis

  • • Log to obtain smaller components of a signal

  • • DCT is applied to extract spectral features

Step 6: Clustering using K-means/FCM/SVM

3. Parameters used for Validation of WSOLA

  • • Signal-to-noise ratio (SNR): A low-level of SNR will decrease how accurately the system can recognize speech. It is the ratio of the summed squared magnitude of the signal to the summed squared magnitude of the noise [22].

    SNR=μ2σ2.

    Here, μ is the signal mean or expected value and σ is the standard deviation of the noise.

  • • Total harmonic distortion (THD): The THD is a measure of the harmonic distortion present in a signal, and is defined as the ratio of the sum of the powers of all harmonic components to the power of the fundamental frequency [23].

    THD=V22+V32+V42+V1,

    where Vn is the root mean square (RMS) value of the nth harmonic voltage and V1 is the RMS value of the fundamental component.

4. Decision Making

We considered two clustering methods and one classifier to calculate the recognition performance of the reconstructed signal obtained from the proposed hybrid method. The MFCC features were modelled using the K-means, FCM, and SVM classifiers. These are briefly described below and parameters considered are tabulated in Table 2.

K-means: It is a method of vector quantization that aims to partition ‘N’ observations into K clusters in which each observation belongs to the cluster with the nearest mean. The K-means algorithm is used to generate a vector quantization codebook for data compression. Each of clusters is defined by its central vector or centroid. According to the Euclidian distance function, the K-means algorithm clusters the data in to K groups that assigns objects to their closest cluster [10].

FCM: This method was introduced with the idea of a fuzzification parameter that determines the degree of fuzziness in clusters. This algorithm works by assigning membership to each data point corresponding to each cluster centre on the basis of the distance between the cluster centre and data point [24]. The performance of the system depends on different inputs and environments such as centroids, the fuzziness parameter-m, termination criteria, norm of the matrix, and partition matrix-U. The different ’m’ values considered are 1.5, 1.75, and 2.

SVM: SVM is a powerful machine learning tool that attempts to obtain a good separating hyper-plane between two classes in the higher dimensional space. It is a predominant technique to estimate the basic parameters of speech. Speech samples can be approximated as a linear combination of samples by minimizing the sum of squared difference between the actual speech samples and predicted values. Thus, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for the linear prediction of speech samples. The importance of this method lies in its stability to provide extremely accurate estimates of speech parameters from data samples, and may have applications in understanding nonlinear phase characteristics of stuttered speech [2528].

5. Dataset

The experiments were conducted on the University College London Archive of Stuttered Speech (UCLASS) dataset. This repository comprises monologues, readings, and conversational recordings. For our simulation, 80 samples at word level were derived from the sentence recordings of the repository. It includes 22 words from female speakers and 58 words from male speakers with age ranging from 11 years to 20 years. The samples are selected to cover different ages and genders [27]. Through listening perception, vocalized, and nonvocalized prolongations are identified and derived manually. The dataset considered for analysis is shown in Table 3.

6. Results and Discussion

WSOLA was applied over a speech signal uttered as ‘mmmm/ooo/thers d/aaaaaaa/y’ (actual word mother’s day) and reconstructed using different scaling factors from 0.1 to 2.0. As per the observation shown in Figure 3, as we decrease the scaling factor, it increases the speech rate. Increasing the scaling factor will speed down the speech rate and analysis of waveforms, as shown in Figure 4. As per the observations shown in Table 4, by varying the window time, the overlap function serves as the reference for natural signal continuity and leads to better performance with reduced SNR and THD.

The WSOLA reconstruction efficiency was analysed for artifacts in three categories of speech: vowel-like, consonant like, and plosive sound, and the results are shown in Table 5. As per the observation for different age groups and sexes for vowel and consonant sounds, WSOLA results in better performance compared to plosive sound, because it increases the sharpness of harmonics and eliminates the discontinuous of the signal by tapering the beginning and end of the frame to zero. It also reduces the spectral distortion due to the overlap.

After further smoothening of the speech signal, it passes through the MFCC and decision-making algorithm using K-means, FCM, and SVM to analyse the effect of WSOLA on the performance of stuttered speech recognition. Improved recognition rate compared to that without WSOLA was observed. A higher recognition rate of 95% is observed using SVM, as shown in Table 6 and Figure 5.

7. Challenges

Selecting the proper scaling rate for prolonged highly polyphonic speech signals is a challenge. Different types of speech signals require WSOLA to be processes with different scaling factors for better performance. Values fixed with the parameters considered for reconstruction also vary depending on the length and type of artifacts present in the speech signal.

8. Conclusion and Future Enhancement

This work proposes a method for smoothening speech signals after prolongation detection and correction using WSOLA. Results demonstrated 95% recognition rate using SVM applied after WSOLA signal smoothening. Additionally, this work can be extended to improve WSOLA utilization by identifying proper methods for selecting better scaling rates that apply to all prolongations with mixed phonemes.

Fig 1.

Figure 1.

WSOLA process illustration.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 399-408https://doi.org/10.5391/IJFIS.2023.23.4.399

Fig 2.

Figure 2.

Stuttered speech recognition process with WSOLA.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 399-408https://doi.org/10.5391/IJFIS.2023.23.4.399

Fig 3.

Figure 3.

Reconstructed speech signal ‘mmmm/ooo/thers d/aaaaaaa/y’ using WSOLA for different scaling factors.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 399-408https://doi.org/10.5391/IJFIS.2023.23.4.399

Fig 4.

Figure 4.

Waveforms of speech signal after prolongation detection, correction, and applying WSOLA: (a) original speech signal, (b) speech signal without prolongation, and (c) speech signal after WSOLA.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 399-408https://doi.org/10.5391/IJFIS.2023.23.4.399

Fig 5.

Figure 5.

Analysis of speech recognition accuracy with and without WSOLA.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 399-408https://doi.org/10.5391/IJFIS.2023.23.4.399

Table 1 . Existing work on TSM and its variants in speech recognition and their application.

StudyVariant of TSMApplicationDatasetResult
[17]OLA, SOLA, WSOLATime-scale modification of speechSpeech signalHigh-quality speech signal
[18]WSOLAAudio packet loss concealmentSpeech signalIncreased sound quality
[14]WSOLAComputer assisted language learning applicationsSpeech signalAnalyse speech in different time scales for desired applications
[16]OLA, SOLAEnhancement of speech recognition in case of diminished time-resolution hearing impairmentSpeech signalLow complexity and high quality of the processed signal
[5]PSOLAReconstruction of smooth speech signals from stuttered speechStuttered speech signalIncreased sound quality
[19]WSOLATime-scale modification of music signalsMusic signalAnalysis of artifacts and importance of parameter choices for desired applications

Table 2 . Parameter setting for decision-making and its variants.

ParameterValue
K-means
 Number of clustersK = 10
 Distance measureEuclidian

FCM
 Fuzzifier1.39 (experimented for values from 1.5–4)
 Distance measureEuclidian

SVM
 Regularization parameter (C)1.0
 KernalPoly (experimented for linear, poly, rbf, sigmoid, precomputed)
 Degree3
 GammaAuto

Table 3 . Dataset considered for experiment.

SL.NOActual wordPronounced wordAge (yr)SexTypes of soundProlongation length (mm)Derived from(.wav files)
D1BallB/aaaa/ll11FVowel20F_0142_11y3m_1
D2DifferentD/iiiiiiiiii/ffernt11F10F_0142_11y7m_1
D3climbedCl/iiiiiiiii/mbed8M10M_0016_1_08y7m_1
D4DepartmentDepaaaaaartment14F30F_0101_14y8m_2
D5TuesdayTueeeeesday16M10M_0104_17y1m_1
D6justJu/ssssssss/t11FConsonant20F_0142_11y7m_1
D7EveryE/vvvv/ery16M20M_0016_07y11m_1
D8understandingUnders/tttttt/anding14F30F_0101_14y8m_2
D9moneyMo/nnnnn/ey11F30F_0142_11y7m_1
D10FindingFin/dddddd/ing12M20M_1206_12y3m_1
D11MomentMomen/tttttt/11FPlossive10F_0142_11y3m_1
D12startedStarte/dddd/11F30F_0142_11y3m_1
D13stepSte/pppppppp/10M30M_1202_10y11m_1
D14partPar/tttt/7M10M_0016_07y11m_1
D15calledCalle/ddddd/8M10M_0016_08y3m_1

Table 4 . Analysis of WSOLA for the signal ‘mmmm/ooo/thers d/aaaaaaa/y’ with varying window times and overlap ratios.

Win timeOverlap_ratioSNR (dB)THD (dBc)
0.0250.5−4.438380959−9.138215347
0.026−3.005498556−8.735852625
0.027−3.370656333−7.979020044
0.028−3.222105032−10.80162906
0.03−3.508334592−9.41019937
0.031−2.34569927−9.221703967
−1.957414329−8.942617797
0.035−3.119797024−6.998573144
0.036−3.363501115−5.087157297
0.038−4.137424842−6.805353737

Table 5 . Analysis of WSOLA for different types of speech.

Categories of speechActual wordPronounced wordAge (yr)SexBefore WSOLAAfter WSOLA


SNR (dB)THD (dBc)SNR (dB)THD (dBc)
Vowel soundKid/k/iiiiiii/d9Female−4.4384−9.1382−3.5083−9.4102
Consonant soundFineFi/nnnnn/e−5.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790−6.0148−6.8973

Vowel soundKid/k/iiiiiii/d7Male−5.4384−9.1382−4.5083−9.4102
Consonant soundFineFi/nnnnn/e−6.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790−5.0148−6.8973

Vowel soundKid/k/iiiiiii/d35Male−8.4384−9.1382−7.5083−9.4102
Consonant soundFineFi/nnnnn/e−5.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790-6.0148−6.8973

Vowel soundKid/k/iiiiiii/d25Female−4.4384−9.1382−3.5083−9.4102
Consonant soundFineFi/nnnnn/e−5.0055−8.7359−2.4386−6.3295
Plosive soundBoldBol/dddddd−6.3707−7.9790−6.0148−6.8973

Table 6 . Effect of WSOLA on decision-making.

Decision-makingCategoriesRecognition rate (%)
Without WSOLAWith WSOLA
K-MeansVowel sound78.1385.14
Consonant sound79.1586.12
Plosive sound76.1984.15

FCMVowel sound85.3588.34
Consonant sound87.1288.34
Plosive sound86.1487.45

SVMVowel sound93.1295.37
Consonant sound94.1295.17
Plosive sound93.1496.12

References

  1. Hermansky, H (2011). Speech recognition from spectral dynamics. Sadhana. 36, 729-744. https://doi.org/10.1007/s12046-011-0044-2
    CrossRef
  2. Vashisht, V, Pandey, AK, and Yadav, SP (2021). Speech recognition using machine learning. IEIE Transactions on Smart Processing & Computing. 10, 233-239. https://doi.org/10.5573/IEIESPC.2021.10.3.233
    CrossRef
  3. Erickson, S, and Block, S (2013). The social and communication impact of stuttering on adolescents and their families. Journal of Fluency Disorders. 38, 311-324. https://doi.org/10.1016/j.jfludis.2013.09.003
    Pubmed CrossRef
  4. Dash, A, Subramani, N, Manjunath, T, Yaragarala, V, and Tripathi, S . Speech recognition and correction of a stuttered speech., Proceedings of 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 2018, pp.1757-1760. https://doi.org/10.1109/ICACCI.2018.8554455
    CrossRef
  5. Deshmukh, OD, Sheth, SS, and Verma, A. Reconstruction of a smooth speech signal from a stuttered speech signal. US Patent No. 8,600,758, Dec 13, 2013. https://patents.google.com/patent/US8600758B2/en
  6. Barczewska, K, and Igras, M (2013). Detection of disfluencies in speech signal. Challenges of Modern Technology. 4, 3-10.
  7. Manjula, G, Kumar, MS, Geetha, YV, and Kasar, T (2017). Identification and validation of repetitions/prolongations in stuttering speech using epoch features. International Journal of Applied Engineering Research. 12, 11976-11980.
  8. Alharbi, S, Hasan, M, Simons, AJ, Brumfitt, S, and Green, P . A lightly supervised approach to detect stuttering in children’s speech., Proceedings of Interspeech 2018, Hyderabad, India, 2018, pp.3433-3437. https://doi.org/10.21437/Interspeech.2018-2155
    CrossRef
  9. Suszynski, W, Kuniszyk-Jozkowiak, W, Smołka, E, and Dzienkowski, M (2003). Prolongation detection with application of fuzzy logic. Annales UMCS Informatica AI–Informatica. 1, 1-8. https://journals.umcs.pl/ai/article/view/2929
  10. Drakshayini, KB, and Anusuya, MA (2023). Repetition detection using spectral parameters and multi tapering features. Indian Journal of Computer Science and Engineering. 14, 684-696. https://doi.org/10.21817/indjcse/2023/v14i4/231404068
    CrossRef
  11. Yeh, JF, Lin, PC, Kuo, MD, and Hsu, ZH (2013). Bilateral waveform similarity overlap-and-add based packet loss concealment for voice over IP. Journal of Applied Research and Technology. 11, 559-567.
    CrossRef
  12. Chalamandaris, A, Tsiakoulis, P, Karabetsos, S, and Raptis, S. An efficient and robust pitch marking algorithm on the speech waveform for TD-PSOLA., Proceedings of 2009 IEEE International Conference on Signal and Image Processing Applications, Kuala Lumpur, Malaysia, 2009, pp.397-401. https://doi.org/10.1109/ICSIPA.2009.5478685
    CrossRef
  13. Rudresh, S, Vasisht, A, Vijayan, K, and Seelamantula, CS. (2018) . Epoch-synchronous overlap-add (ESOLA) for time-and pitch-scale modification of speech signals. [Online]. Available: https://arxiv.org/abs/1801.06492
  14. Howell, P, Davis, S, and Bartrip, J (2009). The University College London Archive of Stuttered Speech (UCLASS). Journal of Speech, Language, and Hearing Research. 52, 556-569. https://doi.org/10.1044/1092-4388(2009/07-0129)
    CrossRef
  15. Demol, M, Struyve, K, Verhelst, W, Paulussen, H, Desmet, P, and Verhoeve, P. Efficient non-uniform time-scaling of speech with WSOLA for CALL applications., Proceedings of InSTIL/ICALL Symposium, Venice, Italy, 2004, pp.1-4. https://www.isca-speech.org/archive/pdfs/icall_2004/demol04_icall.pdf
  16. Kupryjanow, A, and Czyzewski, A. Time-scale modification of speech signals for supporting hearing impaired schoolchildren., Proceedings of the Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 2009, pp.159-162.
  17. Verhelst, W, and Roelands, M . An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech., Proceedings of 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN, USA, 1993, pp.554-557. https://doi.org/10.1109/ICASSP.1993.319366
    CrossRef
  18. Stenger, A, Younes, KB, Reng, R, and Girod, B . A new error concealment technique for audio transmission with packet loss., Proceedings of 1996 8th European Signal Processing Conference (EUSIPCO), Trieste, Italy, 1996, pp.1-4.
  19. Driedger, J, and Muller, M (2016). A review of time-scale modification of music signals. Applied Sciences. 6. article no 57. https://doi.org/10.3390/app6020057
    CrossRef
  20. Chee, LS, Ai, OC, Hariharan, M, and Yaacob, S . MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA., Proceedings of 2009 IEEE Student Conference on Research and Development (SCOReD), Serdang, Malaysia, 2009, pp.146-149. https://doi.org/10.1109/SCORED.2009.5443210
    CrossRef
  21. Ravikumar, K, Rajagopal, R, and Nagaraj, H (2009). An approach for objective assessment of stuttered speech using MFCC. Digital Signal Processing Journal. 9, 19-24. http://www.icgst.com/paper.aspx?pid=P1180852542#
  22. Alencar, PBAD, Lucas, PDA, De Bortoli, E, Bernert, LM, Rodrigues, LP, and Branco-Barreiro, FCA (2020). Acoustically controlled auditory training in children with speech disfluency: a case report. Revista CEFAC. 22. article no. e5420. https://doi.org/10.1590/1982-0216/20202265420
    CrossRef
  23. Wang, H. (2020). Measurement of total harmonic distortion (THD) and its related parameters using multi-instrument. [Online]. Available: https://www.researchgate.net/publication/343107118_Measurement_of_Total_Harmonic_Distortion_THD_and_Its_Related_Parameters_using_Multi-Instrument
  24. Solera-Urena, R, Padrell-Sendra, J, Martin-Iglesias, D, Gallardo-Antolin, A, Pelaez-Moreno, C, and Diaz-de-Maria, F. (2007). SVMs for automatic speech recognition: a survey. Progress in Nonlinear Speech Processing, 190-216. https://doi.org/10.1007/978-3-540-71505-4_11
    CrossRef
  25. Besbes, S, and Lachiri, Z . Multi-class SVM for stressed speech recognition., Proceedings of 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2016, Monastir, Tunisia, Array, pp.782-787. https://doi.org/10.1109/ATSIP.2016.7523188
    CrossRef
  26. Kourkounakis, T, Hajavi, A, and Etemad, A. (2020) . FluentNet: end-to-end detection of speech disfluency with deep learning. [Online]. Available: https://arxiv.org/abs/2009.11394
  27. Garg, S, Mehrotra, U, Krishna, G, and Vuppala, AK (2020). Transfer learning based disfluency detection using stuttered speech. Speech Processing Laboratory, International Institute of Information Technology. Hyderabad, India.
  28. Sheikh, SA, Sahidullah, M, Hirsch, F, and Ouni, S . StutterNet: stuttering detection using time delay neural network., Proceedings of 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021, pp.426-430. https://doi.org/10.23919/EUSIPCO54536.2021.9616063
    CrossRef