Fast Derivation of Cross-lingual Document Vectors from Self-attentive Neural Machine Translation Model

Wei Li, Brian Mak


A universal cross-lingual representation of documents, which can capture the underlying semantics is very useful in many natural language processing tasks. In this paper, we develop a new document vectorization method which effectively selects the most salient sequential patterns from the inputs to create document vectors via a self-attention mechanism using a neural machine translation (NMT) model. The model used by our method can be trained with parallel corpora that are unrelated to the task at hand. During testing, our method will take a monolingual document and convert it into a “Neural machine Translation framework based cross-lingual Document Vector” (NTDV). NTDV has two comparative advantages. Firstly, the NTDV can be produced by the forward pass of the encoder in the NMT and the process is very fast and does not require any training/optimization. Secondly, our model can be conveniently adapted from a pair of existing attention based NMT models and the training requirement on parallel corpus can be reduced significantly. In a cross-lingual document classification task, our NTDV embeddings surpass the previous state-of-the-art performance in the English-to-German classification test and, to our best knowledge, it also achieves the best performance among the fast decoding methods in the German-to-English classification test.


 DOI: 10.21437/Interspeech.2018-1459

Cite as: Li, W., Mak, B. (2018) Fast Derivation of Cross-lingual Document Vectors from Self-attentive Neural Machine Translation Model. Proc. Interspeech 2018, 107-111, DOI: 10.21437/Interspeech.2018-1459.


@inproceedings{Li2018,
  author={Wei Li and Brian Mak},
  title={Fast Derivation of Cross-lingual Document Vectors from Self-attentive Neural Machine Translation Model},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={107--111},
  doi={10.21437/Interspeech.2018-1459},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1459}
}