In this article the authors investigated and presented the experiments on the sentence boundaries annotation from Polish speech using acoustic cues as a source of information. The main result of the investigation is an algorithm for detection of the syntactic boundaries appearing in the places of punctuation marks. In the first stage, the algorithm detects pauses and divides a speech signal into segments. In the second stage, it verifies the configuration of acoustic features and puts hypotheses of the positions of punctuation marks. Classification is performed with parameters describing phone duration and energy, speaking rate, fundamental frequency contours and frequency bands. The best results were achieved for Naive Bayes classifier. The efficiency of the algorithm is 52% precision and 98% recall. Another significant outcome of the research is statistical models of acoustic cues correlated with punctuation in spoken Polish.
This article presents an efficient method of modelling acoustic phenomena for real-time applications such as computer games. Simplified models of reflections, transmission, and medium attenuation are described along with assessments conducted by a professional sound designer. The article introduces representation of sound phenomena using digital filters for further digital audio processing.
A phoneme segmentation method based on the analysis of discrete wavelet transform spectra is described. The localization of phoneme boundaries is particularly useful in speech recognition. It enables one to use more accurate acoustic models since the length of phonemes provide more information for parametrization. Our method relies on the values of power envelopes and their first derivatives for six frequency subbands. Specific scenarios that are typical for phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function, which represent the change of the energy distribution in the frequency domain. The exact definition of this method is described in the paper. The final decision on localization of boundaries is taken by analysis of the event function. Boundaries are, therefore, extracted using information from all subbands. The method was developed on a small set of Polish hand segmented words and tested on another large corpus containing 16 425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets. From this, results with F-score equal to 72.49% were obtained.
Reverberation is a common problem for many speech technologies, such as automatic speech recognition (ASR) systems. This paper investigates the novel combination of precedence, binaural and statistical independence cues for enhancing reverberant speech, prior to ASR, under these adverse acoustical conditions when two microphone signals are available. Results of the enhancement are evaluated in terms of relevant signal measures and accuracy for both English and Polish ASR tasks. These show inconsistencies between the signal and recognition measures, although in recognition the proposed method consistently outperforms all other combinations and the spectral-subtraction baseline.