Twelve LINE Papers Accepted for World's Largest Speech Processing Conference INTERSPEECH 2022

2022.07.21 Technology

● Praise for widespread research on speech recognition and synthesis technologies led to twice as many papers as last year being accepted

 

TOKYO – July 5, 2022 – LINE Corporation ("LINE") is pleased to announce that twelve of its research papers have been accepted by INTERSPEECH 2022, the world's largest conference on spoken language processing.

 

Hosted by the International Speech Communication Association (ISCA), INTERSPEECH is the world's largest conference devoted to speech processing. The papers' authors will be presenting their work at this year's 23rd event from September 18–22 in Incheon, Korea.

 

For INTERSPEECH 2022, LINE doubled its number of accepted papers from last year's six. Of this year's twelve papers, seven were authored by LINE researchers and the other five were co-authored with researchers from NAVER, universities, and other institutions.

 

 

LINE's R&D focus on speech processing technologies

Setting AI as one of its strategic businesses, LINE has continued to create new AI-driven services like those under its AI tech brand LINE CLOVA. The company has also been pouring its efforts into AI technologies and R&D in a bid to forge never-before-seen technologies and accelerate business development. Aiming to shorten the time between research, development, and commercialization in the field of media processing in particular (including speech, language and image processing), teams have gone beyond their own businesses and domains to cooperate on machine learning-focused R&D.

In the field of speech processing, LINE has also presented influential research on the speech recognition and synthesis technologies in its many services at top conferences. Some examples of cutting-edge technologies developed by LINE researchers include Parallel WaveGAN*1 (capable of quickly producing high quality speech), and self-conditioned CTC,*2 which has been demonstrated as being the most accurate among non-autoregressive automatic speech recognition*3 (a type of high-speed method for speech recognition) models. In the field of environmental audio analysis, LINE researchers also won first place at the international DCASE2020 competition.

 

 

Twelve papers accepted, with subjects including augmentation methods for self-conditioned CTC and method for building an emotional speech synthesis model from neutral data

At INTERSPEECH 2022, reviewers praised LINE's wide-ranging research on speech recognition and speech synthesis.

For research into speech recognition, papers [1] and [2] on augmenting self-conditioned CTC were accepted. Paper [1] demonstrates how to train robust speech recognition models by conditioning them with "noisy" intermediate layer predictions. Paper [2] proposes using intermediate layer predictions with external language models and hypothesis searching to enable better conditioning and improve the final performance. For research into audio source separation and multi-channel signal processing, paper [3] demonstrated how high-performing speech recognition could be achieved by introducing a neural source model to time-decorrelation ISS (T-ISS)—a method of using multiple microphones to separate sound—and training the model to perform joint dereverberation and separation. In the field of neural network-based multi-channel audio source separation technologies, paper [4] takes queues from audio source spatial information that has been obtained with classic signal processing techniques (rather than utilizing clean sound sources traditionally required by supervised learning), and demonstrated a high performing model.

When it came to speech synthesis research, paper [5] explored how to build an expressive text-to-speech (TTS) model when there is only a low amount of neutral recordings available for a target speaker. The proposed method can build an expressive TTS model by applying pitch-shift augmentation to a target speaker's neutral recordings and an expressive speaker's recordings, and then generate an emotional speaking style for the target speaker. Paper [6] examines an accent estimation method for Japanese that aims to achieve prosody naturalness. The method consists of two different elements of the Japanese accent: accent phrases that signify the boundaries within which accent changes occur, and accent nuclei that signify the point in a phrase when the accent changes. Unlike the conventional two-stage method that trains separate models to predict accent phrase boundaries and accent nucleus positions, the paper's proposed method uses a multi-task learning framework to make simultaneous predictions and demonstrated significant improvement in estimation accuracy and prosody naturalness. What's more, paper [7] proposes a method for training a model than can synthesize clean speech from sound that contains noise and reverberation. This method utilizes a regularization method that disentangles the training data's recording environment from linguistic contents and speaker information.

 

*1 Parallel WaveGAN (PWG): A non-autoregressive speech generation model and a generative model of machine learning. It uses "Generative Adversarial Network (GAN)" that generates new pseudo data from input data by training using two neural networks.

*2 Self-conditioned CTC: A type of end-to-end speech recognition model that references text predicted in a neural network's intermediate layers to form a final prediction.

*3 Non-autoregressive automatic speech recognition: A method of recognizing speech at each point in time without depending on previously generated text.

 

                                                                                                                     

Accepted papers

1.      Yu Nakagome, Tatsuya Komatsu, Yusuke Fujita, Shuta Ichimura, Yusuke Kida, "InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR"

2.      Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida, "Better Intermediates Improve CTC Inference"

3.      Kohei Saijo, Robin Scheibler, "Independence-based Joint Dereverberation and Separation with Neural Source Model".

4.      Kohei Saijo, Robin Scheibler, "Spatial Loss for Unsupervised Multi-channel Source Separation"

5.      Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana, "Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation"

6.      Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana, "A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech"

7.      Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto, "DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning"

8.      Hyunwook Yoon, Ohsung Kwon, Hoyeon Lee, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim, and Min-Jae Hwang, "Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems"

9.      Eunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan-Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, Jin-Seob Kim, Jae-Min Kim, "TTS-by-TTS 2: Data-selective Augmentation for Neural Speech Synthesis Using Ranking Support Vector Machine with Variational Autoencoder"

10.    Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, and Hiroshi Saruwatari, "STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent"

11.    Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, and Hiroshi Saruwatari, "Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History"

12.    Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe, "ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding"

Note: Papers 1–7 were authored by LINE researchers, while 8–12 were co-authored with other institutions.

 

 

About the company's AI tech brand LINE CLOVA

LINE's AI tech brand LINE CLOVA aims to create a more convenient and enriching world: improving social functions and everyday living by resolving the hidden difficulties of daily life/business with an array of AI technologies and services. Currently, LINE CLOVA offers CLOVA Speech (speech recognition), CLOVA Voice (speech synthesis), as well as solutions that combine these technologies. LINE AiCall and CLOVA Note are just some of these products. The former is an AI-driven telephone answering service that can interact naturally with users and help them find what they are looking for, while the latter is an AI speech recognition app that can accurately recognize spontaneous speech in meetings and interviews and transcribe/manage the resulting text.

 

LINE CLOVA will continue striving to both enhance the quality of its existing offerings and create new features/services by proactively advancing basic research on AI tech.