International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 399-408
Published online December 25, 2023
https://doi.org/10.5391/IJFIS.2023.23.4.399
© The Korean Institute of Intelligent Systems
K. B. Drakshayini1, M. A. Anusuya2, and H. Y. Vani3
1Visvesvaraya Technological University (VTU), Belgaum, India
2Department of Computer Science and Engineering, JSS Science and Technological University, Mysuru, India
3Department of Information Science and Engineering, JSS Science and Technological University, Mysuru, India
Correspondence to :
K. B. Drakshayini (drakshakb@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Stuttering is one of the most common fluency disorders across all age groups. In this work, we propose a novel approach for reconstructing speech signals after prolongation detection and correction phase by applying waveform similarity overlap-add (WSOLA). Further processing of speech signals after prolongation detection and correction ensures sufficient signal continuity at segment joins by requiring maximal similarity to the natural continuity of the input signal. This work presents a major contribution towards improving the quality of speech signals after prolongation detection and phase correction, with further processing of speech signals for feature extraction followed by classification and phase clustering. The results are analysed in WSOLA phase for differentiating results before and after reconstruction using metrics such as the signal-to-noise ratio and total harmonic distortion. The WSOLA results are analysed with respect to different parameters such as the window time and overlap ratio. Moreover, the proposed method is implemented on a wide variety of signals derived from the University College London Archive of Stuttered Speech (UCLASS) dataset. A recognition accuracy of 95% was observed for output signals processed using WSOLA and applied over the K-means, fuzzy C-mean and support vector machine classifiers.
Keywords: WSOLA, Prolongation detection and correction, Stuttered speech recognition, Cross correlation
The efficiency of speech communication depends on fluency of speech delivery. Speech disorders are common communication impediments defined as any condition where a person has difficulty in forming speech sounds during communication [1, 2]. Among the various speech disorder, the cause for stuttering is unclear, and theories generally highlight factors such as genetic disposition, psychological factors, biological explanation, and family issues [3, 4]. For stuttered speech recognition systems using prolonged speech signals, it is required to further enhance the quality of the speech signals even after prolongation detection and correction [5–10]. In this work waveform similarity overlap-add (WSOLA) was applied before applying classification/clustering methods for stuttered speech. WSOLA could adjust the level of the audio segments for overlap and add them in order to maintain a consistent audio signal level. Then, the reconstructed signal was applied over Mel-frequency cepstral coefficients (MFCC) feature extraction and different decision-making algorithms such as K-means, fuzzy C-means (FCM), and support vector machine (SVM).
The rest of this paper is organized as follows: Section 2 describes the methodology adopted, Section 3 discusses the validation of the proposed method, Section 4 discusses the decision-making phase, and Section 5 discusses the dataset. The results and observations are discussed in Section 6, Section 7 discusses the challenges, and Section 8 concludes the work.
Algorithms for high quality time-scale modification of speech are important for further smoothing of stuttered speech signals, where the potential of controlling the apparent speaking rate is a desirable feature. The WSOLA algorithm produces high quality speech output, is algorithmically and computationally efficient, and robust. The principle of WSOLA depends on wave similarity, and it conducts operations in the time domain [11].
WSOLA was designed based on the OLA waveform editing techniques [12–15]. If
However, when constructing the output signal in this manner, the individual segments are added incoherently. This introduces irregularities and distortions in the time-scaled result.
In contrast, WSOLA introduces a tolerance Δ
Considering propagation in the left to right direction in the waveforms in Figure 1, we assume that the segment (S1) was last excised from the input signal and added to the output in position (a). We then need to find a segment (S2), located approximately at the time instant
Existing works related to time-scale modification (TSM) and its variants in speech recognition and their applications are tabulated in Table 1 [5, 14, 16–19].
The stuttered speech recognition process is shown in Figure 2. Stuttered speech signals are fed through pre-processing, then through a hybrid method for prolongation detection and correction, then reconstructed using WSOLA and processed through feature extraction and decision-making for stuttered speech recognition.
Stuttered speech recognition process with intermission of prolongation detection, correction, and WSOLA proceeds as follows
• The signal is divided into frames of 200-ms duration.
• Compute short-term energy, zero-crossing rate (ZCR), spectral entropy and centroid parameters for each frame for the complete signal strength.
• Autocorrelation function is applied to compute the similarity between the adjacent frames for all the parameters.
• Every parameter is identified and fixed with the threshold values.
• Average autocorrelation values for the parameters are computed.
• If the autocorrelation value between the adjacent frames is higher than the threshold, frames are retained by considering normal speech segments, else it is identified as a prolonged frame.
• Frames obtained in previous steps are identified as un-prolonged frames and used for reconstructing the speech signal.
• Input audio signal
• Application of Hann window function,
• The next analysis frame
• The similarity is calculated using the autocorrelation function.
• Overlap-add using the specified synthesis hop size (
• The signal is divided into frames of 20 ms
• Multi-tapering windowing (number of tapers: 5)
• For each frame, apply DFT to convert signals from time domain to frequency domain
• Mel-scale filter bank analysis
• Log to obtain smaller components of a signal
• DCT is applied to extract spectral features
Here,
where
We considered two clustering methods and one classifier to calculate the recognition performance of the reconstructed signal obtained from the proposed hybrid method. The MFCC features were modelled using the K-means, FCM, and SVM classifiers. These are briefly described below and parameters considered are tabulated in Table 2.
K-means: It is a method of vector quantization that aims to partition ‘
FCM: This method was introduced with the idea of a fuzzification parameter that determines the degree of fuzziness in clusters. This algorithm works by assigning membership to each data point corresponding to each cluster centre on the basis of the distance between the cluster centre and data point [24]. The performance of the system depends on different inputs and environments such as centroids, the fuzziness parameter-m, termination criteria, norm of the matrix, and partition matrix-U. The different ’m’ values considered are 1.5, 1.75, and 2.
SVM: SVM is a powerful machine learning tool that attempts to obtain a good separating hyper-plane between two classes in the higher dimensional space. It is a predominant technique to estimate the basic parameters of speech. Speech samples can be approximated as a linear combination of samples by minimizing the sum of squared difference between the actual speech samples and predicted values. Thus, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for the linear prediction of speech samples. The importance of this method lies in its stability to provide extremely accurate estimates of speech parameters from data samples, and may have applications in understanding nonlinear phase characteristics of stuttered speech [25–28].
The experiments were conducted on the University College London Archive of Stuttered Speech (UCLASS) dataset. This repository comprises monologues, readings, and conversational recordings. For our simulation, 80 samples at word level were derived from the sentence recordings of the repository. It includes 22 words from female speakers and 58 words from male speakers with age ranging from 11 years to 20 years. The samples are selected to cover different ages and genders [27]. Through listening perception, vocalized, and nonvocalized prolongations are identified and derived manually. The dataset considered for analysis is shown in Table 3.
WSOLA was applied over a speech signal uttered as ‘mmmm/ooo/thers d/aaaaaaa/y’ (actual word mother’s day) and reconstructed using different scaling factors from 0.1 to 2.0. As per the observation shown in Figure 3, as we decrease the scaling factor, it increases the speech rate. Increasing the scaling factor will speed down the speech rate and analysis of waveforms, as shown in Figure 4. As per the observations shown in Table 4, by varying the window time, the overlap function serves as the reference for natural signal continuity and leads to better performance with reduced SNR and THD.
The WSOLA reconstruction efficiency was analysed for artifacts in three categories of speech: vowel-like, consonant like, and plosive sound, and the results are shown in Table 5. As per the observation for different age groups and sexes for vowel and consonant sounds, WSOLA results in better performance compared to plosive sound, because it increases the sharpness of harmonics and eliminates the discontinuous of the signal by tapering the beginning and end of the frame to zero. It also reduces the spectral distortion due to the overlap.
After further smoothening of the speech signal, it passes through the MFCC and decision-making algorithm using K-means, FCM, and SVM to analyse the effect of WSOLA on the performance of stuttered speech recognition. Improved recognition rate compared to that without WSOLA was observed. A higher recognition rate of 95% is observed using SVM, as shown in Table 6 and Figure 5.
Selecting the proper scaling rate for prolonged highly polyphonic speech signals is a challenge. Different types of speech signals require WSOLA to be processes with different scaling factors for better performance. Values fixed with the parameters considered for reconstruction also vary depending on the length and type of artifacts present in the speech signal.
This work proposes a method for smoothening speech signals after prolongation detection and correction using WSOLA. Results demonstrated 95% recognition rate using SVM applied after WSOLA signal smoothening. Additionally, this work can be extended to improve WSOLA utilization by identifying proper methods for selecting better scaling rates that apply to all prolongations with mixed phonemes.
No potential conflict of interest relevant to this article was reported.
Reconstructed speech signal ‘mmmm/ooo/thers d/aaaaaaa/y’ using WSOLA for different scaling factors.
Waveforms of speech signal after prolongation detection, correction, and applying WSOLA: (a) original speech signal, (b) speech signal without prolongation, and (c) speech signal after WSOLA.
Table 1. Existing work on TSM and its variants in speech recognition and their application.
Study | Variant of TSM | Application | Dataset | Result |
---|---|---|---|---|
[17] | OLA, SOLA, WSOLA | Time-scale modification of speech | Speech signal | High-quality speech signal |
[18] | WSOLA | Audio packet loss concealment | Speech signal | Increased sound quality |
[14] | WSOLA | Computer assisted language learning applications | Speech signal | Analyse speech in different time scales for desired applications |
[16] | OLA, SOLA | Enhancement of speech recognition in case of diminished time-resolution hearing impairment | Speech signal | Low complexity and high quality of the processed signal |
[5] | PSOLA | Reconstruction of smooth speech signals from stuttered speech | Stuttered speech signal | Increased sound quality |
[19] | WSOLA | Time-scale modification of music signals | Music signal | Analysis of artifacts and importance of parameter choices for desired applications |
Table 2. Parameter setting for decision-making and its variants.
Parameter | Value |
---|---|
K-means | |
Number of clusters | |
Distance measure | Euclidian |
FCM | |
Fuzzifier | 1.39 (experimented for values from 1.5–4) |
Distance measure | Euclidian |
SVM | |
Regularization parameter (C) | 1.0 |
Kernal | Poly (experimented for linear, poly, rbf, sigmoid, precomputed) |
Degree | 3 |
Gamma | Auto |
Table 3. Dataset considered for experiment.
SL.NO | Actual word | Pronounced word | Age (yr) | Sex | Types of sound | Prolongation length (mm) | Derived from(.wav files) |
---|---|---|---|---|---|---|---|
D1 | Ball | B/aaaa/ll | 11 | F | Vowel | 20 | F_0142_11y3m_1 |
D2 | Different | D/iiiiiiiiii/ffernt | 11 | F | 10 | F_0142_11y7m_1 | |
D3 | climbed | Cl/iiiiiiiii/mbed | 8 | M | 10 | M_0016_1_08y7m_1 | |
D4 | Department | Depaaaaaartment | 14 | F | 30 | F_0101_14y8m_2 | |
D5 | Tuesday | Tueeeeesday | 16 | M | 10 | M_0104_17y1m_1 | |
D6 | just | Ju/ssssssss/t | 11 | F | Consonant | 20 | F_0142_11y7m_1 |
D7 | Every | E/vvvv/ery | 16 | M | 20 | M_0016_07y11m_1 | |
D8 | understanding | Unders/tttttt/anding | 14 | F | 30 | F_0101_14y8m_2 | |
D9 | money | Mo/nnnnn/ey | 11 | F | 30 | F_0142_11y7m_1 | |
D10 | Finding | Fin/dddddd/ing | 12 | M | 20 | M_1206_12y3m_1 | |
D11 | Moment | Momen/tttttt/ | 11 | F | Plossive | 10 | F_0142_11y3m_1 |
D12 | started | Starte/dddd/ | 11 | F | 30 | F_0142_11y3m_1 | |
D13 | step | Ste/pppppppp/ | 10 | M | 30 | M_1202_10y11m_1 | |
D14 | part | Par/tttt/ | 7 | M | 10 | M_0016_07y11m_1 | |
D15 | called | Calle/ddddd/ | 8 | M | 10 | M_0016_08y3m_1 |
Table 4. Analysis of WSOLA for the signal ‘mmmm/ooo/thers d/aaaaaaa/y’ with varying window times and overlap ratios.
Win time | Overlap_ratio | SNR (dB) | THD (dBc) |
---|---|---|---|
0.025 | 0.5 | −4.438380959 | −9.138215347 |
0.026 | −3.005498556 | −8.735852625 | |
0.027 | −3.370656333 | −7.979020044 | |
0.028 | −3.222105032 | −10.80162906 | |
0.03 | −3.508334592 | −9.41019937 | |
0.031 | −2.34569927 | −9.221703967 | |
−1.957414329 | −8.942617797 | ||
0.035 | −3.119797024 | −6.998573144 | |
0.036 | −3.363501115 | −5.087157297 | |
0.038 | −4.137424842 | −6.805353737 |
Table 5. Analysis of WSOLA for different types of speech.
Categories of speech | Actual word | Pronounced word | Age (yr) | Sex | Before WSOLA | After WSOLA | ||
---|---|---|---|---|---|---|---|---|
SNR (dB) | THD (dBc) | SNR (dB) | THD (dBc) | |||||
Vowel sound | Kid | /k/iiiiiii/d | 9 | Female | −4.4384 | −9.1382 | −3.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −5.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | −6.0148 | −6.8973 | ||
Vowel sound | Kid | /k/iiiiiii/d | 7 | Male | −5.4384 | −9.1382 | −4.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −6.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | −5.0148 | −6.8973 | ||
Vowel sound | Kid | /k/iiiiiii/d | 35 | Male | −8.4384 | −9.1382 | −7.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −5.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | -6.0148 | −6.8973 | ||
Vowel sound | Kid | /k/iiiiiii/d | 25 | Female | −4.4384 | −9.1382 | −3.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −5.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | −6.0148 | −6.8973 |
Table 6. Effect of WSOLA on decision-making.
Decision-making | Categories | Recognition rate (%) | |
---|---|---|---|
Without WSOLA | With WSOLA | ||
K-Means | Vowel sound | 78.13 | 85.14 |
Consonant sound | 79.15 | 86.12 | |
Plosive sound | 76.19 | 84.15 | |
FCM | Vowel sound | 85.35 | 88.34 |
Consonant sound | 87.12 | 88.34 | |
Plosive sound | 86.14 | 87.45 | |
SVM | Vowel sound | 93.12 | 95.37 |
Consonant sound | 94.12 | 95.17 | |
Plosive sound | 93.14 | 96.12 |
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 399-408
Published online December 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.4.399
Copyright © The Korean Institute of Intelligent Systems.
K. B. Drakshayini1, M. A. Anusuya2, and H. Y. Vani3
1Visvesvaraya Technological University (VTU), Belgaum, India
2Department of Computer Science and Engineering, JSS Science and Technological University, Mysuru, India
3Department of Information Science and Engineering, JSS Science and Technological University, Mysuru, India
Correspondence to:K. B. Drakshayini (drakshakb@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Stuttering is one of the most common fluency disorders across all age groups. In this work, we propose a novel approach for reconstructing speech signals after prolongation detection and correction phase by applying waveform similarity overlap-add (WSOLA). Further processing of speech signals after prolongation detection and correction ensures sufficient signal continuity at segment joins by requiring maximal similarity to the natural continuity of the input signal. This work presents a major contribution towards improving the quality of speech signals after prolongation detection and phase correction, with further processing of speech signals for feature extraction followed by classification and phase clustering. The results are analysed in WSOLA phase for differentiating results before and after reconstruction using metrics such as the signal-to-noise ratio and total harmonic distortion. The WSOLA results are analysed with respect to different parameters such as the window time and overlap ratio. Moreover, the proposed method is implemented on a wide variety of signals derived from the University College London Archive of Stuttered Speech (UCLASS) dataset. A recognition accuracy of 95% was observed for output signals processed using WSOLA and applied over the K-means, fuzzy C-mean and support vector machine classifiers.
Keywords: WSOLA, Prolongation detection and correction, Stuttered speech recognition, Cross correlation
The efficiency of speech communication depends on fluency of speech delivery. Speech disorders are common communication impediments defined as any condition where a person has difficulty in forming speech sounds during communication [1, 2]. Among the various speech disorder, the cause for stuttering is unclear, and theories generally highlight factors such as genetic disposition, psychological factors, biological explanation, and family issues [3, 4]. For stuttered speech recognition systems using prolonged speech signals, it is required to further enhance the quality of the speech signals even after prolongation detection and correction [5–10]. In this work waveform similarity overlap-add (WSOLA) was applied before applying classification/clustering methods for stuttered speech. WSOLA could adjust the level of the audio segments for overlap and add them in order to maintain a consistent audio signal level. Then, the reconstructed signal was applied over Mel-frequency cepstral coefficients (MFCC) feature extraction and different decision-making algorithms such as K-means, fuzzy C-means (FCM), and support vector machine (SVM).
The rest of this paper is organized as follows: Section 2 describes the methodology adopted, Section 3 discusses the validation of the proposed method, Section 4 discusses the decision-making phase, and Section 5 discusses the dataset. The results and observations are discussed in Section 6, Section 7 discusses the challenges, and Section 8 concludes the work.
Algorithms for high quality time-scale modification of speech are important for further smoothing of stuttered speech signals, where the potential of controlling the apparent speaking rate is a desirable feature. The WSOLA algorithm produces high quality speech output, is algorithmically and computationally efficient, and robust. The principle of WSOLA depends on wave similarity, and it conducts operations in the time domain [11].
WSOLA was designed based on the OLA waveform editing techniques [12–15]. If
However, when constructing the output signal in this manner, the individual segments are added incoherently. This introduces irregularities and distortions in the time-scaled result.
In contrast, WSOLA introduces a tolerance Δ
Considering propagation in the left to right direction in the waveforms in Figure 1, we assume that the segment (S1) was last excised from the input signal and added to the output in position (a). We then need to find a segment (S2), located approximately at the time instant
Existing works related to time-scale modification (TSM) and its variants in speech recognition and their applications are tabulated in Table 1 [5, 14, 16–19].
The stuttered speech recognition process is shown in Figure 2. Stuttered speech signals are fed through pre-processing, then through a hybrid method for prolongation detection and correction, then reconstructed using WSOLA and processed through feature extraction and decision-making for stuttered speech recognition.
Stuttered speech recognition process with intermission of prolongation detection, correction, and WSOLA proceeds as follows
• The signal is divided into frames of 200-ms duration.
• Compute short-term energy, zero-crossing rate (ZCR), spectral entropy and centroid parameters for each frame for the complete signal strength.
• Autocorrelation function is applied to compute the similarity between the adjacent frames for all the parameters.
• Every parameter is identified and fixed with the threshold values.
• Average autocorrelation values for the parameters are computed.
• If the autocorrelation value between the adjacent frames is higher than the threshold, frames are retained by considering normal speech segments, else it is identified as a prolonged frame.
• Frames obtained in previous steps are identified as un-prolonged frames and used for reconstructing the speech signal.
• Input audio signal
• Application of Hann window function,
• The next analysis frame
• The similarity is calculated using the autocorrelation function.
• Overlap-add using the specified synthesis hop size (
• The signal is divided into frames of 20 ms
• Multi-tapering windowing (number of tapers: 5)
• For each frame, apply DFT to convert signals from time domain to frequency domain
• Mel-scale filter bank analysis
• Log to obtain smaller components of a signal
• DCT is applied to extract spectral features
Here,
where
We considered two clustering methods and one classifier to calculate the recognition performance of the reconstructed signal obtained from the proposed hybrid method. The MFCC features were modelled using the K-means, FCM, and SVM classifiers. These are briefly described below and parameters considered are tabulated in Table 2.
K-means: It is a method of vector quantization that aims to partition ‘
FCM: This method was introduced with the idea of a fuzzification parameter that determines the degree of fuzziness in clusters. This algorithm works by assigning membership to each data point corresponding to each cluster centre on the basis of the distance between the cluster centre and data point [24]. The performance of the system depends on different inputs and environments such as centroids, the fuzziness parameter-m, termination criteria, norm of the matrix, and partition matrix-U. The different ’m’ values considered are 1.5, 1.75, and 2.
SVM: SVM is a powerful machine learning tool that attempts to obtain a good separating hyper-plane between two classes in the higher dimensional space. It is a predominant technique to estimate the basic parameters of speech. Speech samples can be approximated as a linear combination of samples by minimizing the sum of squared difference between the actual speech samples and predicted values. Thus, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for the linear prediction of speech samples. The importance of this method lies in its stability to provide extremely accurate estimates of speech parameters from data samples, and may have applications in understanding nonlinear phase characteristics of stuttered speech [25–28].
The experiments were conducted on the University College London Archive of Stuttered Speech (UCLASS) dataset. This repository comprises monologues, readings, and conversational recordings. For our simulation, 80 samples at word level were derived from the sentence recordings of the repository. It includes 22 words from female speakers and 58 words from male speakers with age ranging from 11 years to 20 years. The samples are selected to cover different ages and genders [27]. Through listening perception, vocalized, and nonvocalized prolongations are identified and derived manually. The dataset considered for analysis is shown in Table 3.
WSOLA was applied over a speech signal uttered as ‘mmmm/ooo/thers d/aaaaaaa/y’ (actual word mother’s day) and reconstructed using different scaling factors from 0.1 to 2.0. As per the observation shown in Figure 3, as we decrease the scaling factor, it increases the speech rate. Increasing the scaling factor will speed down the speech rate and analysis of waveforms, as shown in Figure 4. As per the observations shown in Table 4, by varying the window time, the overlap function serves as the reference for natural signal continuity and leads to better performance with reduced SNR and THD.
The WSOLA reconstruction efficiency was analysed for artifacts in three categories of speech: vowel-like, consonant like, and plosive sound, and the results are shown in Table 5. As per the observation for different age groups and sexes for vowel and consonant sounds, WSOLA results in better performance compared to plosive sound, because it increases the sharpness of harmonics and eliminates the discontinuous of the signal by tapering the beginning and end of the frame to zero. It also reduces the spectral distortion due to the overlap.
After further smoothening of the speech signal, it passes through the MFCC and decision-making algorithm using K-means, FCM, and SVM to analyse the effect of WSOLA on the performance of stuttered speech recognition. Improved recognition rate compared to that without WSOLA was observed. A higher recognition rate of 95% is observed using SVM, as shown in Table 6 and Figure 5.
Selecting the proper scaling rate for prolonged highly polyphonic speech signals is a challenge. Different types of speech signals require WSOLA to be processes with different scaling factors for better performance. Values fixed with the parameters considered for reconstruction also vary depending on the length and type of artifacts present in the speech signal.
This work proposes a method for smoothening speech signals after prolongation detection and correction using WSOLA. Results demonstrated 95% recognition rate using SVM applied after WSOLA signal smoothening. Additionally, this work can be extended to improve WSOLA utilization by identifying proper methods for selecting better scaling rates that apply to all prolongations with mixed phonemes.
WSOLA process illustration.
Stuttered speech recognition process with WSOLA.
Reconstructed speech signal ‘mmmm/ooo/thers d/aaaaaaa/y’ using WSOLA for different scaling factors.
Waveforms of speech signal after prolongation detection, correction, and applying WSOLA: (a) original speech signal, (b) speech signal without prolongation, and (c) speech signal after WSOLA.
Analysis of speech recognition accuracy with and without WSOLA.
Table 1 . Existing work on TSM and its variants in speech recognition and their application.
Study | Variant of TSM | Application | Dataset | Result |
---|---|---|---|---|
[17] | OLA, SOLA, WSOLA | Time-scale modification of speech | Speech signal | High-quality speech signal |
[18] | WSOLA | Audio packet loss concealment | Speech signal | Increased sound quality |
[14] | WSOLA | Computer assisted language learning applications | Speech signal | Analyse speech in different time scales for desired applications |
[16] | OLA, SOLA | Enhancement of speech recognition in case of diminished time-resolution hearing impairment | Speech signal | Low complexity and high quality of the processed signal |
[5] | PSOLA | Reconstruction of smooth speech signals from stuttered speech | Stuttered speech signal | Increased sound quality |
[19] | WSOLA | Time-scale modification of music signals | Music signal | Analysis of artifacts and importance of parameter choices for desired applications |
Table 2 . Parameter setting for decision-making and its variants.
Parameter | Value |
---|---|
K-means | |
Number of clusters | |
Distance measure | Euclidian |
FCM | |
Fuzzifier | 1.39 (experimented for values from 1.5–4) |
Distance measure | Euclidian |
SVM | |
Regularization parameter (C) | 1.0 |
Kernal | Poly (experimented for linear, poly, rbf, sigmoid, precomputed) |
Degree | 3 |
Gamma | Auto |
Table 3 . Dataset considered for experiment.
SL.NO | Actual word | Pronounced word | Age (yr) | Sex | Types of sound | Prolongation length (mm) | Derived from(.wav files) |
---|---|---|---|---|---|---|---|
D1 | Ball | B/aaaa/ll | 11 | F | Vowel | 20 | F_0142_11y3m_1 |
D2 | Different | D/iiiiiiiiii/ffernt | 11 | F | 10 | F_0142_11y7m_1 | |
D3 | climbed | Cl/iiiiiiiii/mbed | 8 | M | 10 | M_0016_1_08y7m_1 | |
D4 | Department | Depaaaaaartment | 14 | F | 30 | F_0101_14y8m_2 | |
D5 | Tuesday | Tueeeeesday | 16 | M | 10 | M_0104_17y1m_1 | |
D6 | just | Ju/ssssssss/t | 11 | F | Consonant | 20 | F_0142_11y7m_1 |
D7 | Every | E/vvvv/ery | 16 | M | 20 | M_0016_07y11m_1 | |
D8 | understanding | Unders/tttttt/anding | 14 | F | 30 | F_0101_14y8m_2 | |
D9 | money | Mo/nnnnn/ey | 11 | F | 30 | F_0142_11y7m_1 | |
D10 | Finding | Fin/dddddd/ing | 12 | M | 20 | M_1206_12y3m_1 | |
D11 | Moment | Momen/tttttt/ | 11 | F | Plossive | 10 | F_0142_11y3m_1 |
D12 | started | Starte/dddd/ | 11 | F | 30 | F_0142_11y3m_1 | |
D13 | step | Ste/pppppppp/ | 10 | M | 30 | M_1202_10y11m_1 | |
D14 | part | Par/tttt/ | 7 | M | 10 | M_0016_07y11m_1 | |
D15 | called | Calle/ddddd/ | 8 | M | 10 | M_0016_08y3m_1 |
Table 4 . Analysis of WSOLA for the signal ‘mmmm/ooo/thers d/aaaaaaa/y’ with varying window times and overlap ratios.
Win time | Overlap_ratio | SNR (dB) | THD (dBc) |
---|---|---|---|
0.025 | 0.5 | −4.438380959 | −9.138215347 |
0.026 | −3.005498556 | −8.735852625 | |
0.027 | −3.370656333 | −7.979020044 | |
0.028 | −3.222105032 | −10.80162906 | |
0.03 | −3.508334592 | −9.41019937 | |
0.031 | −2.34569927 | −9.221703967 | |
−1.957414329 | −8.942617797 | ||
0.035 | −3.119797024 | −6.998573144 | |
0.036 | −3.363501115 | −5.087157297 | |
0.038 | −4.137424842 | −6.805353737 |
Table 5 . Analysis of WSOLA for different types of speech.
Categories of speech | Actual word | Pronounced word | Age (yr) | Sex | Before WSOLA | After WSOLA | ||
---|---|---|---|---|---|---|---|---|
SNR (dB) | THD (dBc) | SNR (dB) | THD (dBc) | |||||
Vowel sound | Kid | /k/iiiiiii/d | 9 | Female | −4.4384 | −9.1382 | −3.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −5.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | −6.0148 | −6.8973 | ||
Vowel sound | Kid | /k/iiiiiii/d | 7 | Male | −5.4384 | −9.1382 | −4.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −6.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | −5.0148 | −6.8973 | ||
Vowel sound | Kid | /k/iiiiiii/d | 35 | Male | −8.4384 | −9.1382 | −7.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −5.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | -6.0148 | −6.8973 | ||
Vowel sound | Kid | /k/iiiiiii/d | 25 | Female | −4.4384 | −9.1382 | −3.5083 | −9.4102 |
Consonant sound | Fine | Fi/nnnnn/e | −5.0055 | −8.7359 | −2.4386 | −6.3295 | ||
Plosive sound | Bold | Bol/dddddd | −6.3707 | −7.9790 | −6.0148 | −6.8973 |
Table 6 . Effect of WSOLA on decision-making.
Decision-making | Categories | Recognition rate (%) | |
---|---|---|---|
Without WSOLA | With WSOLA | ||
K-Means | Vowel sound | 78.13 | 85.14 |
Consonant sound | 79.15 | 86.12 | |
Plosive sound | 76.19 | 84.15 | |
FCM | Vowel sound | 85.35 | 88.34 |
Consonant sound | 87.12 | 88.34 | |
Plosive sound | 86.14 | 87.45 | |
SVM | Vowel sound | 93.12 | 95.37 |
Consonant sound | 94.12 | 95.17 | |
Plosive sound | 93.14 | 96.12 |
WSOLA process illustration.
|@|~(^,^)~|@|Stuttered speech recognition process with WSOLA.
|@|~(^,^)~|@|Reconstructed speech signal ‘mmmm/ooo/thers d/aaaaaaa/y’ using WSOLA for different scaling factors.
|@|~(^,^)~|@|Waveforms of speech signal after prolongation detection, correction, and applying WSOLA: (a) original speech signal, (b) speech signal without prolongation, and (c) speech signal after WSOLA.
|@|~(^,^)~|@|Analysis of speech recognition accuracy with and without WSOLA.