This paper proposes X-net, a jointly learned scale-down and scale-up architecture for data pre- and post-processing in voice calls, as a means to bandwidth extension over band-limited channels. Scale-down and scale-up are deployed separately on transmitter and receiver to perform down- and upsampling. Separate supervisions are used on the submodules so that X-net can work properly even if one submodule is missing. A two-stage training method is used to learn X-net for improved perceptual quality. Results show that jointly learned X-net achieves promising improvement over blind audio super-resolution by both objective and subjective metrics, even in a lightweight implementation with only 1k parameters.
Cite as: Wen, L., Wang, L., Wen, X., Zheng, Y., Park, Y., Choi, K.P. (2021) X-net: A Joint Scale Down and Scale Up Method for Voice Call. Proc. Interspeech 2021, 1644-1648, doi: 10.21437/Interspeech.2021-812
@inproceedings{wen21_interspeech, author={Liang Wen and Lizhong Wang and Xue Wen and Yuxing Zheng and Youngo Park and Kwang Pyo Choi}, title={{X-net: A Joint Scale Down and Scale Up Method for Voice Call}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1644--1648}, doi={10.21437/Interspeech.2021-812} }