Twee: A Novel Text-To-Speech Engine

Authors

  • Das D Dept. of Computer Science & Engineering, University Institute of Technology, The University of Burdwan, Golapbag (North), Burdwan- 713104, West Bengal, India
  • Hassan H Dept. of Computer Science & Engineering, University Institute of Technology, The University of Burdwan, Golapbag (North), Burdwan- 713104, West Bengal, India
  • Gupta S Dept. of Computer Science & Engineering, University Institute of Technology, The University of Burdwan, Golapbag (North), Burdwan- 713104, West Bengal, India

Keywords:

Artificial Intelligence, Natural Language Processing, Digital Signal Processing, Phoneme, Emotion

Abstract

With the advancement of technology and the widespread use of smart devices, the world has witnessed that the networking and/or the connectivity horizon has broadened to an exalted level. One of the prominent researches being undertaken in this digital era is the development of Text-to-Speech (TTS) engines; which is capable enough of offering more interactivity with the prevalent smart devices. There are various TTS engines available in the market currently, but these engines lack the capability of showing the effects of human voice e.g., they fail to provide credible indications of the sentiment, mood or emotional state of mind of the speaker etc. Further speaking, presently there is no comprehensible or consummate TTS engine that could replicate human behaviour and/or mannerisms with utmost precision and accuracy. This paper proposes a novel Text-to-Speech engine named ‘Twee’ whose pronunciation works in sync with real world human intelligence. The proposed system is an application of the interdisciplinary field of research whereby domains such as Natural Language Processing, Artificial Intelligence and Digital Signal Processing are amalgamated to perform sentiment analysis on text through the processing of phonemes. This system works well both in mono channel mode and in stereo mode and is capable of generating varied effects on a voice depending on the type of communication.

References

[1] A. Drahota, A. Costall, V. Reddy, “The Vocal Communication of Different Kinds of Smile”, Speech Communication, Vol. 50, Issue.4, pp.278-287, 2007. doi: 10.1016/j.specom.2007.10.001

[2] W.Y. Wang, K. Georgila, “Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis”, In the Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, pp.289-294, 2011.

[3] R.E. Remez, P.E. Rubin, D.B. Pisoni, T.D. Carrell, “Speech Perception without Traditional Speech Cues”, Science, New Series, Vol.212, Issue.4497, pp. 947-950, 1981. doi:10.1126/science.7233191

[4] J. Zhang, “Language Generation and Speech Synthesis in Dialogues for Language Learning”, Massachusetts Institute of Technology, pp.1-68, 2004.

[5] S. Lemmetty, “Review of Speech Synthesis Technology”, Helsinki Universty of Technology, pp.1-113, 1999.

[6] I.G. Mattingly,"Speech synthesis for phonetic and phonological models", Current Trends in Linguistics. Mouton, The Hague, Vol. 12, pp.2451–2487, 1974.

[7] FFmpeg Git, "FFmpeg 4.0 "Wu"", last accessed 2018-07-18.

[8] Takanishi Lab Webpage, "Anthropomorphic Talking Robot Waseda Talker Series", Retrieved from http://www.takanishi.mech.waseda.ac.jp/top/research/voice/index.htm, last accessed 2018-10-10.

[9] Deepmind Webpage, "WaveNet: A Generative Model for Raw Audio”, Retrieved from https://deepmind.com/blog/wavenet-generative-model-raw-audio/, last accessed 2018-09-08.

Downloads

Published

2025-11-24

How to Cite

[1]
D. Das, H. Hassan, and S. Gupta, “Twee: A Novel Text-To-Speech Engine”, Int. J. Comp. Sci. Eng., vol. 7, no. 1, pp. 67–70, Nov. 2025.