Access to large amounts of annotated data is a challenge for companies that develop high-quality AI-based services. Crowdsourcing platforms allow for a collaborative way of solving the data collection problem using many people, communities, groups, and resources. However, with the growing need for data, it becomes increasingly harder to source the right crowd and maintain quality. In the particular case of speech data, a critical aspect of data quality is related to the verification of crowd participants as native speakers of a specific language, dialect, or variety.
In this work, we propose to use automatic nativeness classification (NC) to tackle this problem. NC can be regarded as a particular case of spoken language recognition, a field that has benefited from recent breakthroughs based on deep neural network methods allowing for performances that have begun to exceed human-level capabilities. In our case, we aim at developing a variant-sensitive nativeness classifier to be used as quality control of crowdsourced data (replacing the more traditional human validation step). This work focuses on Portuguese (European and Brazilian variants) and English (American, British, and Indian). In particular, we explore how NC can be integrated into a crowdsourcing speech data collection pipeline, where nativeness is computed as data is collected (potentially blocking non-native speakers from continuing contributing). We test three different speaker-embedding-based frameworks: i-vectors, x-vectors, and h-vectors. Experimental results have shown that the proposed system based on the x-vector outperforms the baseline system with a 9% relative improvement.