Abstract
Background: Emotional speech synthesis is the process of synthesising emotions in a neutral speech – potentially generated by a text-to-speech system – to make an artificial humanmachine interaction human-like. It typically involves analysis and modification of speech parameters. Existing work on speech synthesis involving modification of prosody parameters does so at sentence, word, and syllable level. However, further fine-grained modification at vowel level has not been explored yet, thereby motivating our work.
Objective: To explore prosody parameters at vowel level for emotion synthesis.
Methods: Our work modifies prosody features (duration, pitch, and intensity) for emotion synthesis. Specifically, it modifies the duration parameter of vowel-like and pause regions and the pitch and intensity parameters of only vowel-like regions. The modification is gender specific using emotional speech templates stored in a database and done using Pitch Synchronous Overlap and Add (PSOLA) method.
Results: Comparison was done with the existing work on prosody modification at sentence, word and syllable label on IITKGP-SEHSC database. Improvements of 8.14%, 13.56%, and 2.80% for emotions angry, happy, and fear respectively were obtained for the relative mean opinion score. This was due to: (1) prosody modification at vowel-level being more fine-grained than sentence, word, or syllable level and (2) prosody patterns not being generated for consonant regions because vocal cords do not vibrate during consonant production.
Conclusion: Our proposed work shows that an emotional speech generated using prosody modification at vowel-level is more convincible than prosody modification at sentence, word and syllable level.
Keywords: Duration, emotional speech, intensity, pitch, PSOLA, prosody modification, vowel onset-offset points.
Graphical Abstract