12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

An Efficient Unified Extraction Algorithm for Bilingual Data

Christoph Tillmann (1), Sanjika Hewavitharana (2)

(1) IBM T.J. Watson Research Center, USA
(2) Carnegie Mellon University, USA

The paper presents a unified algorithm for aligning sentences with their translations in bilingual data. The sentence alignment problem is handled as a large-scale pattern recognition problem similar to the task of finding the word sequence that corresponds to an acoustic input signal in isolated word automatic speech recognition (ASR). The algorithm gains efficiency from related work on dynamic programming (DP) search for speech recognition ([1]): a stack-based search is parametrized in a novel way, such that the unified algorithm can be used on various types of data that have been previously handled by separate implementations: the extracted text chunk pairs can be either sub-sentential pairs, one-to-one, or many-to-many sentence-level pairs. The one-stage search algorithm is carried out in a single run over the data. With the help of a unified beam-search candidate pruning, the algorithm is very efficient: it avoids any document-level pre-filtering and uses less restrictive sentence-level filtering. Results are presented on a Russian-English and a Spanish-English extraction task. Based on a simple word-based scoring model, text chunk pairs are extracted out of several trillion candidates.


  1. H. Ney, “The Use of a One-stage Dynamic Programming Algorithm for Connected Word Recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 263–271, 1984.

Full Paper

Bibliographic reference.  Tillmann, Christoph / Hewavitharana, Sanjika (2011): "An efficient unified extraction algorithm for bilingual data", In INTERSPEECH-2011, 2093-2096.