SIGIR 2020 Tutorial: Searching the Web for Cross-lingual Web Data

While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.

Ahmed El-Kishky¹, Philipp Koehn², Holger Schwenk¹

Facebook AI¹, Johns Hopkins University²

Outline

Preliminaries of mining the web for parallel data

Scarcity of parallel text and abundance of large heterogenous corpora
Parallel sentences for training machine translation
Aligned multilingual documents and sentences

Web crawling and multilingual corpora

ParaCrawl: Web-scale parallel corpora for the languages of the EU
Wikipedia web data
CommonCrawl: open repository of web crawl data

Cross-lingual representations

MUSE: Multilingual unsupervised and supervised embeddings
LASER: Language Agnostic Sentence Representations
XLM and XLM-R: Transformer-based cross-lingual representations

Parallel Document Retrieval

Metadata-based approaches and URL-aligned
Translations with tf/idf approaches
Embedding-based retrieval

Document Level Parallel Sentence Retrieval

HunAlign: Dictionary-based Sentence Alignment
Gargantua: unsupervised sentence alignment
BleuAlign: MT-based sentence alignment
VecAlign: Order-aware Sentence ALignment in Linear-time and Space

Global Parallel Sentence Retrieval

FAISS: Billion-scale similarity search with GPUs
WikiMatrix: Laser with margin retrieval on Wikipedia
CCMatrix: LASER with margin retrieval on CommonCrawl

Parallel Sentence Filtering

Zipporah: a fast and scalable data cleaning algorithm
BiCleaner: Feature-based filtering
Low-resource corpus filtering with multilingual sentence embeddings

Presenters

Ahmed El-Kishky, is a Research Scientist at Facebook AI where he works on developing automated methods for ob-taining machine translation training data. Before that, El-Kishky received his PhD from the University of Illinois at Urbana-Champaign where he was supported by the National Science Foundation Graduate Research Fellowship (NSF-GRF) and the National Defense Science and Engineering Graduate (NDSEG) Fellowship. In his career, El-Kishky has published papers and given tutorials in venues such as KDD, VLDB, SIGMOD, WWW, WSDM, ICDM, and BIGDATA.

Philipp Koehn, is a professor in the Department of Computer Science at Johns Hopkins University, is recognized worldwide for his leading research in and applications for developing and understanding data-driven methods to solve long-standing, real-world challenges of machine translation and machine learning. Koehn authored the textbook Statistical Machine Translaton and Neural Machine Translation. Koehn serves on the editorial boards for multiple journals, among them: Transaction of the Association of Computational Linguistics; Machine Translation Journal; Artificial Intelligence Review; Computation, Corpora, Cognition, and ACM Transactions on Asian and Low-Resource Language Information Processing. Koehn is president of the ACL Special Interest Group on Machine Translation which organizes a series of ACL workshops and conferences on Machine Translation since 2005 (WMT).

Holger Schwenk, is a research scientist at Facebook Artificial Intelligence Research Paris. He received his PhD in Computer Science from the University Paris in 1996, and prior to joining Facebook in 2015, he was professor of computer science at the University of Le Mans where he led a large group on statistical machine translation. During Schwenk's career, he has authored papers in top machine learning and natural language processing venues such as ACL, NAACL, EMNLP, and NeurIPS.

SIGIR 2020 Tutorial: Searching the Web for Cross-lingual Web Data

Outline

Tutorial

Code

Data

Publications

Presenters