ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation

Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as whole conversations consisting of multiple utterances. It is simply because, in such a case, the number of time steps consumed by its internal module called inter-chunk RNN becomes extremely large. To mitigate this problem, this paper proposes a multi-path RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical manner. In the MPRNN framework, the input data is represented at several (≥3) time-resolutions, each of which is modeled by a specific RNN sub-module. For example, the RNN sub-module that deals with the finest resolution may model temporal relationship only within a phoneme, while the RNN sub-module handling the most coarse resolution may capture only the relationship between utterances such as speaker information. We perform experiments using simulated dialogue-like mixtures and show that MPRNN has greater model capacity, and it outperforms the current state-of-the-art DPRNN framework especially in online processing scenarios.


doi: 10.21437/Interspeech.2020-2388

Cite as: Kinoshita, K., Neumann, T.v., Delcroix, M., Nakatani, T., Haeb-Umbach, R. (2020) Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. Proc. Interspeech 2020, 2652-2656, doi: 10.21437/Interspeech.2020-2388

@inproceedings{kinoshita20_interspeech,
  author={Keisuke Kinoshita and Thilo von Neumann and Marc Delcroix and Tomohiro Nakatani and Reinhold Haeb-Umbach},
  title={{Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2652--2656},
  doi={10.21437/Interspeech.2020-2388}
}