SIGIR 2020 Tutorial: Searching the Web for Cross-lingual Web Data
While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.Facebook AI1, Johns Hopkins University2
Outline
- Preliminaries of mining the web for parallel data
- Scarcity of parallel text and abundance of large heterogenous corpora
- Parallel sentences for training machine translation
- Aligned multilingual documents and sentences
- Web crawling and multilingual corpora
- ParaCrawl: Web-scale parallel corpora for the languages of the EU
- Wikipedia web data
- CommonCrawl: open repository of web crawl data
- Cross-lingual representations
- MUSE: Multilingual unsupervised and supervised embeddings
- LASER: Language Agnostic Sentence Representations
- XLM and XLM-R: Transformer-based cross-lingual representations
- Parallel Document Retrieval
- Metadata-based approaches and URL-aligned
- Translations with tf/idf approaches
- Embedding-based retrieval
- Document Level Parallel Sentence Retrieval
- HunAlign: Dictionary-based Sentence Alignment
- Gargantua: unsupervised sentence alignment
- BleuAlign: MT-based sentence alignment
- VecAlign: Order-aware Sentence ALignment in Linear-time and Space
- Global Parallel Sentence Retrieval
- FAISS: Billion-scale similarity search with GPUs
- WikiMatrix: Laser with margin retrieval on Wikipedia
- CCMatrix: LASER with margin retrieval on CommonCrawl
- Parallel Sentence Filtering
- Zipporah: a fast and scalable data cleaning algorithm
- BiCleaner: Feature-based filtering
- Low-resource corpus filtering with multilingual sentence embeddings
Tutorial
[Paper],[SLIDES]Code
[FAISS][LASER][BiCleaner][MUSE]Data
[WikiMatrix]Publications
- Schwenk H. et al WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia ,
- El-Kishky A. et al A Massive Collection of Cross-lingual Web Documents,
- Schwenk H. et al CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web,
- Chaudhary et al Low-Resource Corpus Filtering using Multilingual Sentence Embeddings,
- El-Kishky A. et al Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance,
- Johnson J. et al Billion-scale similarity search with GPUs,
- Thompson B. et al Vecalign: Improved Sentence Alignment in Linear Time and Space,
- Xu H. et al Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora,
Presenters
Ahmed El-Kishky, is a Research Scientist at Facebook AI where he works on developing automated methods for ob-taining machine translation training data. Before that, El-Kishky received his PhD from the University of Illinois at Urbana-Champaign where he was supported by the National Science Foundation Graduate Research Fellowship (NSF-GRF) and the National Defense Science and Engineering Graduate (NDSEG) Fellowship. In his career, El-Kishky has published papers and given tutorials in venues such as KDD, VLDB, SIGMOD, WWW, WSDM, ICDM, and BIGDATA.
Philipp Koehn, is a professor in the Department of Computer Science at Johns Hopkins University, is recognized worldwide for his leading research in and applications for developing and understanding data-driven methods to solve long-standing, real-world challenges of machine translation and machine learning. Koehn authored the textbook Statistical Machine Translaton and Neural Machine Translation. Koehn serves on the editorial boards for multiple journals, among them: Transaction of the Association of Computational Linguistics; Machine Translation Journal; Artificial Intelligence Review; Computation, Corpora, Cognition, and ACM Transactions on Asian and Low-Resource Language Information Processing. Koehn is president of the ACL Special Interest Group on Machine Translation which organizes a series of ACL workshops and conferences on Machine Translation since 2005 (WMT).
Holger Schwenk, is a research scientist at Facebook Artificial Intelligence Research Paris. He received his PhD in Computer Science from the University Paris in 1996, and prior to joining Facebook in 2015, he was professor of computer science at the University of Le Mans where he led a large group on statistical machine translation. During Schwenk's career, he has authored papers in top machine learning and natural language processing venues such as ACL, NAACL, EMNLP, and NeurIPS.