Deep Learning-Based Methods for Speech Synthesis and Conversion
In Speech Synthesis by Text-to-Speech conversion, it is desirable that the synthesized speech signals are natural and intelligible. Great advances have already been achieved and solutions capable of synthesizing expressive speech signals, especially those containing emotion, have been proposed. However, there is still difficulty in converting speech with information such as emotions, with problems such as confusion of synthesized emotions, and difficulty in transferring emotions without sending, together, the speaker's personal characteristics. A recent solution to these problems was proposed by Li T. et al. seeking to increase the ability to differentiate emotions. However in Li T. et al. 2020, a proprietary database was used, making the reproducibility of studies related to this topic difficult. Furthermore, it requires training a synthesis model, as it is not possible to insert style information into already trained models. Therefore, in this work some analyzes are presented for this solution using known databases, such as RAVDESS and CREMA-D, and the use of an already trained Mellotron model, which can use explicit speech information, which can be useful for analyzed tasks.