As the accuracy of acoustic modeling has increased following the
revolution of deep neural networks, the synthetic quality of text-to-speech
(TTS) systems has improved significantly. However, these systems still have a
major shortcoming in that a lot of training corpora are required to learn the
complex nature of speech production. Typically, the conventional unit-selection
TTS system requires more than 100 hours of human recordings to faithfully
provide human-like voices to real-world applications such as AI speaker,
audiobook, etc.
In NAVER Corp., one of the biggest IT companies in Korea, Eunwoo
Song first introduced a speaker-adaptive training method for the neural TTS
system. His contribution has significantly reduced the minimum amount of voice
recordings from 100 hours to 1~4 hours, which enabled mass production of the
TTS voices. As a result, NAVER could provide high-quality synthetic voices to
customers who have used various NAVER services such as AI speaker, GPS
navigation, and Papago translation. For instance, Clova Dubbing service can
combine more than 100 TTS voices with the video clips and has assisted teachers in preparing online classes during the Pandemic. Nowadays, it is even
possible to make a personalized TTS by using smart phone recordings. Specifically, he and his team members launched a campaign called “Mother’s Voice”
that will make and provide 100 user’s personalized voice who have heartwarming stories with the family members.
Besides his work in NAVER Corp., he also shares his insights related
to adaptive TTS techniques. Through the lectures (Seoul National
University,. KAIST, Yonsei University, Korea University, etc) and the paper
presentations (more than 30 presentations), he contributes to joint
growth with speech researchers from the both industry and academic fields.