ISCA Archive SSW 2023
ISCA Archive SSW 2023

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Ondřej Plátek, Ondrej Dusek

We present MooseNet, a trainable speech metric that predictsthe listeners’ Mean Opinion Score (MOS). We propose a novelapproach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neuralnetwork (NN) model. We show that PLDA works well witha non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistentlyimproves various neural MOS prediction models, even stateof-the-art models with task-specific fine-tuning. Our ablationstudy shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL modelfine-tuning using a convenient optimizer choice and additionalcontrastive and multi-task training objectives. The fine-tunedMooseNet NN with the PLDA module achieves the best results,surpassing the SSL baseline on the VoiceMOS Challenge data.


doi: 10.21437/SSW.2023-8

Cite as: Plátek, O., Dusek, O. (2023) MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 48-54, doi: 10.21437/SSW.2023-8

@inproceedings{platek23_ssw,
  author={Ondřej Plátek and Ondrej Dusek},
  title={{MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={48--54},
  doi={10.21437/SSW.2023-8}
}