A phoneme segmentation method based on the analysis of discrete wavelet transform spectra is described. The localization of phoneme boundaries is particularly useful in speech recognition. It enables one to use more accurate acoustic models since the length of phonemes provide more information for parametrization. Our method relies on the values of power envelopes and their first derivatives for six frequency subbands. Specific scenarios that are typical for phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function, which represent the change of the energy distribution in the frequency domain. The exact definition of this method is described in the paper. The final decision on localization of boundaries is taken by analysis of the event function. Boundaries are, therefore, extracted using information from all subbands. The method was developed on a small set of Polish hand segmented words and tested on another large corpus containing 16 425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets. From this, results with F-score equal to 72.49% were obtained.
The human voice is one of the basic means of communication, thanks to which one also can easily convey the emotional state. This paper presents experiments on emotion recognition in human speech based on the fundamental frequency. AGH Emotional Speech Corpus was used. This database consists of audio samples of seven emotions acted by 12 different speakers (6 female and 6 male). We explored phrases of all the emotions – all together and in various combinations. Fast Fourier Transformation and magnitude spectrum analysis were applied to extract the fundamental tone out of the speech audio samples. After extraction of several statistical features of the fundamental frequency, we studied if they carry information on the emotional state of the speaker applying different AI methods. Analysis of the outcome data was conducted with classifiers: K-Nearest Neighbours with local induction, Random Forest, Bagging, JRip, and Random Subspace Method from algorithms collection for data mining WEKA. The results prove that the fundamental frequency is a prospective choice for further experiments.
Speaker‘s emotional states are recognized from speech signal with Additive white Gaussian noise (AWGN). The influence of white noise on a typical emotion recogniztion system is studied. The emotion classifier is implemented with Gaussian mixture model (GMM). A Chinese speech emotion database is used for training and testing, which includes nine emotion classes (e.g. happiness, sadness, anger, surprise, fear, anxiety, hesitation, confidence and neutral state). Two speech enhancement algorithms are introduced for improved emotion classification. In the experiments, the Gaussian mixture model is trained on the clean speech data, while tested under AWGN with various signal to noise ratios (SNRs). The emotion class model and the dimension space model are both adopted for the evaluation of the emotion recognition system. Regarding the emotion class model, the nine emotion classes are classified. Considering the dimension space model, the arousal dimension and the valence dimension are classified into positive regions or negative regions. The experimental results show that the speech enhancement algorithms constantly improve the performance of our emotion recognition system under various SNRs, and the positive emotions are more likely to be miss-classified as negative emotions under white noise environment.
Speech emotion recognition is an important part of human-machine interaction studies. The acoustic analysis method is used for emotion recognition through speech. An emotion does not cause changes on all acoustic parameters. Rather, the acoustic parameters affected by emotion vary depending on the emotion type. In this context, the emotion-based variability of acoustic parameters is still a current field of study. The purpose of this study is to investigate the acoustic parameters that fear affects and the extent of their influence. For this purpose, various acoustic parameters were obtained from speech records containing fear and neutral emotions. The change according to the emotional states of these parameters was analyzed using statistical methods, and the parameters and the degree of influence that the fear emotion affected were determined. According to the results obtained, the majority of acoustic parameters that fear affects vary according to the used data. However, it has been demonstrated that formant frequencies, mel-frequency cepstral coefficients, and jitter parameters can define the fear emotion independent of the data used.
This study examined whether differences in reverberation time (RT) between typical sound field test rooms used in audiology clinics have an effect on speech recognition in multi-talker environments. Separate groups of participants listened to target speech sentences presented simultaneously with 0-to-3 competing sentences through four spatially-separated loudspeakers in two sound field test rooms having RT = 0:6 sec (Site 1: N = 16) and RT = 0:4 sec (Site 2: N = 12). Speech recognition scores (SRSs) for the Synchronized Sentence Set (S3) test and subjective estimates of perceived task difficulty were recorded. Obtained results indicate that the change in room RT from 0.4 to 0.6 sec did not significantly influence SRSs in quiet or in the presence of one competing sentence. However, this small change in RT affected SRSs when 2 and 3 competing sentences were present, resulting in mean SRSs that were about 8-10% better in the room with RT = 0:4 sec. Perceived task difficulty ratings increased as the complexity of the task increased, with average ratings similar across test sites for each level of sentence competition. These results suggest that site-specific normative data must be collected for sound field rooms if clinicians would like to use two or more directional speech maskers during routine sound field testing.
The aim of this work was to measure subjective speech intelligibility in an enclosure with a long reverberation time and comparison of these results with objective parameters. Impulse Responses (IRs) were first determined with a dummy head in different measurement points of the enclosure. The following objective parameters were calculated with Dirac 4.1 software: Reverberation Time (RT), Early Decay Time (EDT), weighted Clarity (C50) and Speech Transmission Index (STI). For the chosen measurement points, a convolution of the IRs with the Polish Sentence Test (PST) and logatome tests was made. PST was presented at a background of a babble noise and speech reception threshold - SRT (i.e. SNR yielding 50% speech intelligibility) for those points were evaluated. A relationship of the sentence and logatome recognition vs. STI was determined. It was found that the final SRT data are well correlated with speech transmission index (STI), and can be expressed by a psychometric function. The difference between SRT determined in condition without reverberation and in reverberation conditions appeared to be a good measure of the effect of reverberation on speech intelligibility in a room. In addition, speech intelligibility, with and without use of the sound amplification system installed in the enclosure, was compared.
This paper describes research behind a Large-Vocabulary Continuous Speech Recognition (LVCSR) system for the transcription of Senate speeches for the Polish language. The system utilizes severalcomponents: a phonetic transcription system, language and acoustic model training systems, a Voice Activity Detector (VAD), a LVCSR decoder, and a subtitle generator and presentation system. Some of the modules relied on already available tools and some had to be made from the beginning but the authors ensured that they used the most advanced techniques they had available at the time. Finally, several experiments were performed to compare the performance of both more modern and more conventional technologies.
In this paper, a new feature-extraction method is proposed to achieve robustness of speech recognition systems. This method combines the benefits of phase autocorrelation (PAC) with bark wavelet transform. PAC uses the angle to measure correlation instead of the traditional autocorrelation measure, whereas the bark wavelet transform is a special type of wavelet transform that is particularly designed for speech signals. The extracted features from this combined method are called phase autocorrelation bark wavelet transform (PACWT) features. The speech recognition performance of the PACWT features is evaluated and compared to the conventional feature extraction method mel frequency cepstrum coefficients (MFCC) using TI-Digits database under different types of noise and noise levels. This database has been divided into male and female data. The result shows that the word recognition rate using the PACWT features for noisy male data (white noise at 0 dB SNR) is 60%, whereas it is 41.35% for the MFCC features under identical conditions
Affective computing studies and develops systems capable of detecting humans affects. The search for universal well-performing features for speech-based emotion recognition is ongoing. In this paper, a small set of features with support vector machines as the classifier is evaluated on Surrey Audio-Visual Expressed Emotion database, Berlin Database of Emotional Speech, Polish Emotional Speech database and Serbian emotional speech database. It is shown that a set of 87 features can offer results on-par with state-of-the-art, yielding 80.21, 88.6, 75.42 and 93.41% average emotion recognition rate, respectively. In addition, an experiment is conducted to explore the significance of gender in emotion recognition using random forests. Two models, trained on the first and second database, respectively, and four speakers were used to determine the effects. It is seen that the feature set used in this work performs well for both male and female speakers, yielding approximately 27% average emotion recognition in both models. In addition, the emotions for female speakers were recognized 18% of the time in the first model and 29% in the second. A similar effect is seen with male speakers: the first model yields 36%, the second 28% a verage emotion recognition rate. This illustrates the relationship between the constitution of training data and emotion recognition accuracy.
This paper describes a Deep Belief Neural Network (DBNN) and Bidirectional Long-Short Term Memory (LSTM) hybrid used as an acoustic model for Speech Recognition. It was demonstrated by many independent researchers that DBNNs exhibit superior performance to other known machine learning frameworks in terms of speech recognition accuracy. Their superiority comes from the fact that these are deep learning networks. However, a trained DBNN is simply a feed-forward network with no internal memory, unlike Recurrent Neural Networks (RNNs) which are Turing complete and do posses internal memory, thus allowing them to make use of longer context. In this paper, an experiment is performed to make a hybrid of a DBNN with an advanced bidirectional RNN used to process its output. Results show that the use of the new DBNN-BLSTM hybrid as the acoustic model for the Large Vocabulary Continuous Speech Recognition (LVCSR) increases word recognition accuracy. However, the new model has many parameters and in some cases it may suffer performance issues in real-time applications.
Laughter is one of the most important paralinguistic events, and it has specific roles in human conversation. The automatic detection of laughter occurrences in human speech can aid automatic speech recognition systems as well as some paralinguistic tasks such as emotion detection. In this study we apply Deep Neural Networks (DNN) for laughter detection, as this technology is nowadays considered state-of-the-art in similar tasks like phoneme identification. We carry out our experiments using two corpora containing spontaneous speech in two languages (Hungarian and English). Also, as we find it reasonable that not all frequency regions are required for efficient laughter detection, we will perform feature selection to find the sufficient feature subset.
The same speech sounds (phones) produced by different speakers can sometimes exhibit significant differences. Therefore, it is essential to use algorithms compensating these differences in ASR systems. Speaker clustering is an attractive solution to the compensation problem, as it does not require long utterances or high computational effort at the recognition stage. The report proposes a clustering method based solely on adaptation of UBM model weights. This solution has turned out to be effective even when using a very short utterance. The obtained improvement of frame recognition quality measured by means of frame error rate is over 5%. It is noteworthy that this improvement concerns all vowels, even though the clustering discussed in this report was based only on the phoneme a. This indicates a strong correlation between the articulation of different vowels, which is probably related to the size of the vocal tract.
Speech emotion recognition is deemed to be a meaningful and intractable issue among a number of do- mains comprising sentiment analysis, computer science, pedagogy, and so on. In this study, we investigate speech emotion recognition based on sparse partial least squares regression (SPLSR) approach in depth. We make use of the sparse partial least squares regression method to implement the feature selection and dimensionality reduction on the whole acquired speech emotion features. By the means of exploiting the SPLSR method, the component parts of those redundant and meaningless speech emotion features are lessened to zero while those serviceable and informative speech emotion features are maintained and selected to the following classification step. A number of tests on Berlin database reveal that the recogni- tion rate of the SPLSR method can reach up to 79.23% and is superior to other compared dimensionality reduction methods.