Twee: A Novel Text-To-Speech Engine
Keywords:
Artificial Intelligence, Natural Language Processing, Digital Signal Processing, Phoneme, EmotionAbstract
With the advancement of technology and the widespread use of smart devices, the world has witnessed that the networking and/or the connectivity horizon has broadened to an exalted level. One of the prominent researches being undertaken in this digital era is the development of Text-to-Speech (TTS) engines; which is capable enough of offering more interactivity with the prevalent smart devices. There are various TTS engines available in the market currently, but these engines lack the capability of showing the effects of human voice e.g., they fail to provide credible indications of the sentiment, mood or emotional state of mind of the speaker etc. Further speaking, presently there is no comprehensible or consummate TTS engine that could replicate human behaviour and/or mannerisms with utmost precision and accuracy. This paper proposes a novel Text-to-Speech engine named ‘Twee’ whose pronunciation works in sync with real world human intelligence. The proposed system is an application of the interdisciplinary field of research whereby domains such as Natural Language Processing, Artificial Intelligence and Digital Signal Processing are amalgamated to perform sentiment analysis on text through the processing of phonemes. This system works well both in mono channel mode and in stereo mode and is capable of generating varied effects on a voice depending on the type of communication.
References
[1] A. Drahota, A. Costall, V. Reddy, “The Vocal Communication of Different Kinds of Smile”, Speech Communication, Vol. 50, Issue.4, pp.278-287, 2007. doi: 10.1016/j.specom.2007.10.001
[2] W.Y. Wang, K. Georgila, “Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis”, In the Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, pp.289-294, 2011.
[3] R.E. Remez, P.E. Rubin, D.B. Pisoni, T.D. Carrell, “Speech Perception without Traditional Speech Cues”, Science, New Series, Vol.212, Issue.4497, pp. 947-950, 1981. doi:10.1126/science.7233191
[4] J. Zhang, “Language Generation and Speech Synthesis in Dialogues for Language Learning”, Massachusetts Institute of Technology, pp.1-68, 2004.
[5] S. Lemmetty, “Review of Speech Synthesis Technology”, Helsinki Universty of Technology, pp.1-113, 1999.
[6] I.G. Mattingly,"Speech synthesis for phonetic and phonological models", Current Trends in Linguistics. Mouton, The Hague, Vol. 12, pp.2451–2487, 1974.
[7] FFmpeg Git, "FFmpeg 4.0 "Wu"", last accessed 2018-07-18.
[8] Takanishi Lab Webpage, "Anthropomorphic Talking Robot Waseda Talker Series", Retrieved from http://www.takanishi.mech.waseda.ac.jp/top/research/voice/index.htm, last accessed 2018-10-10.
[9] Deepmind Webpage, "WaveNet: A Generative Model for Raw Audio”, Retrieved from https://deepmind.com/blog/wavenet-generative-model-raw-audio/, last accessed 2018-09-08.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
