Four Research Papers Selected for Largest International Conference on Acoustics, Speech, and Signal Processing, INTERSPEECH 2023

2023.07.20 Technology

Recognized for research on speech and environmental sound recognition

TOKYO — June 30, 2023 — LINE Corporation is pleased to announce that four of its research papers have been selected for presentation at the renowned International Conference on Acoustics, Speech and Signal Processing, INTERSPEECH 2023 (Dublin, Ireland). The conference will be held from August 20 to 24.

Hosted by the International Speech Communication Association (ISCA), INTERSPEECH is the world's largest conference devoted to speech processing, which will be held for the 24th time this year. Of the four papers, two were lead-authored by LINE and other two were co-authored with Tokyo University. All of the papers will be presented at the upcoming conference. 


Papers recognized for improving accuracy in speech recognition and environmental sound recognition

The first paper [1] proposes a speech recognition model to enhance the accuracy of keyword detection in a system that combines speech recognition and keyword detection. Traditionally, it has been challenging to fine-tune the speech recognition model specifically for keywords. Consequently, when there is an error in speech recognition, the system fails to detect the intended keywords. The proposed method allows for simultaneous tuning of both speech recognition and keyword detection by decomposing the training data into keyword and non-keyword sequences (see Figure 1). The proposed simultaneous tuning approach has demonstrated higher overall accuracy compared to the conventional method in the task of detecting keywords such as katakana nouns and numeral sequences.


(a) Detects keywords after speech recognition (b) Simultaneously fine-tunes speech recognition and keyword detection
Figure 1: Comparison of keyword detection methods

The second paper [2] proposes a novel method for the separation of moving sources in the context of speech and sound event recognition. Conventional methods typically segment the input signal into short windows during which the sources are assumed static. Unfortunately, this also degrades the separation quality, making it difficult to recognize speech, or the sound of footsteps, for example. Instead, the proposed method introduces in the separation module a self-attention mechanism that allows using the full length of the input signal, significantly improving the separation performance (see Figure 2). In particular, the separation of two moving sources with the proposed method substantially improves the word error rate of automatic speech recognition and sound event recognition performance compared to the conventional method.


Figure 2: Comparison of self-attention mechanisms that follow the moving sources


Accepted papers

1."Target Vocabulary Recognition Based on Multi-Task Learning with Decomposed Teacher Sequences", Aoi Ito, Tatsuya Komatsu, Yusuke Fujita, Yusuke Kida

2."Multi-channel separation of dynamic speech and sound events", Takuya Fujimura, Robin Scheibler

3."CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center", Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, and Hiroshi Saruwatari

4."ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived ContextWord Embeddings", Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, and Hiroshi Saruwatari


Note: Papers 1–2 were lead-authored by LINE researchers while papers 3–4 were co-authored with Tokyo University.


While developing new AI-driven services, LINE has also been focusing its efforts on conducting AI research and development. In particular, the company has presented influential research mostly on speech recognition and synthesis at top conferences in the field of speech processing. Some examples of cutting-edge technologies developed by LINE researchers include Parallel Wave GAN*1 (capable of quickly producing high quality speech), and self-conditioned CTC*3, which has been demonstrated as being the most accurate among non-autoregressive automatic speech recognition*2 (a type of high-speed method for speech recognition) models. In the field of environmental sound analysis, LINE researchers also won first place at the international DCASE2020 competition. 

Going forward, LINE will continue to enhance the quality of its services as well as create new features and services by proactively advancing basic research on AI.


*1 Parallel WaveGAN (PWG): A non-autoregressive speech generation model and a generative model of machine learning. It uses "Generative Adversarial Network (GAN)" that generates new pseudo data from input data by training using two neural networks.

*2 Non-autoregressive automatic speech recognition: A method of recognizing speech at each point in time without depending on previously generated text.

*3 Self-conditioned CTC: A type of end-to-end speech recognition model that references text predicted in a neural network's intermediate layers to form a final prediction.