Generic placeholder image

Recent Advances in Computer Science and Communications

Editor-in-Chief

ISSN (Print): 2666-2558
ISSN (Online): 2666-2566

Research Article

Synthesis of Emotional Speech by Prosody Modification of Vowel Segments of Neutral Speech

Author(s): Md Shah Fahad*, Shreya Singh, Shruti Gupta, Akshay Deepak and Abhinav

Volume 14, Issue 4, 2021

Published on: 12 November, 2019

Page: [1226 - 1235] Pages: 10

DOI: 10.2174/2213275912666191112144014

Price: $65

conference banner
Abstract

Background: Emotional speech synthesis is the process of synthesising emotions in a neutral speech – potentially generated by a text-to-speech system – to make an artificial humanmachine interaction human-like. It typically involves analysis and modification of speech parameters. Existing work on speech synthesis involving modification of prosody parameters does so at sentence, word, and syllable level. However, further fine-grained modification at vowel level has not been explored yet, thereby motivating our work.

Objective: To explore prosody parameters at vowel level for emotion synthesis.

Methods: Our work modifies prosody features (duration, pitch, and intensity) for emotion synthesis. Specifically, it modifies the duration parameter of vowel-like and pause regions and the pitch and intensity parameters of only vowel-like regions. The modification is gender specific using emotional speech templates stored in a database and done using Pitch Synchronous Overlap and Add (PSOLA) method.

Results: Comparison was done with the existing work on prosody modification at sentence, word and syllable label on IITKGP-SEHSC database. Improvements of 8.14%, 13.56%, and 2.80% for emotions angry, happy, and fear respectively were obtained for the relative mean opinion score. This was due to: (1) prosody modification at vowel-level being more fine-grained than sentence, word, or syllable level and (2) prosody patterns not being generated for consonant regions because vocal cords do not vibrate during consonant production.

Conclusion: Our proposed work shows that an emotional speech generated using prosody modification at vowel-level is more convincible than prosody modification at sentence, word and syllable level.

Keywords: Duration, emotional speech, intensity, pitch, PSOLA, prosody modification, vowel onset-offset points.

Graphical Abstract


Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy