MTData for WMT22

This document provides instructions for downloading WMT22 General MT task datasets for constrained track using mtdata.

1. Setup

pip install mtdata==0.3.7
# pip install https://github.com/thammegowda/mtdata/archive/develop.zip  # Install from develop branch

2. Get Recipes File

Config file for CONSTRAINED track (missing datasets behing registration: CzEng2.0, and CCMT):

wget https://www.statmt.org/wmt22/mtdata/mtdata.recipes.wmt22-constrained.yml

By default, the recipe file has to be in the current directory (where mtdata is invoked) and the name has to match mtdata.recipes*.yml glob. If you would like to place all your recipe YML files in a specific directory, then export MTDATA_RECIPES=/path/to/dir

If you are considering to participate in UNCONSTRAINED track, then any data is allowed. For example, you may use following config file containing larger set of corpora.

wget https://www.statmt.org/wmt22/mtdata/mtdata.recipes.wmt22-unconstrained.yml

3. List Available Recipes

$ mtdata list-recipe | cut -f1 | grep wmt22
wmt22-csen
wmt22-deen
wmt22-jaen
wmt22-ruen
wmt22-zhen
wmt22-frde
wmt22-hren
wmt22-liven
wmt22-uken
wmt22-ukcs
wmt22-sahru

wmt22* ids are all loaded from mtdata.recipes.wmt22*.yml file.

4. Download Recipes

Download a Recipe

# example: wmt22-csen
mtdata get-recipe -ri wmt22-csen -o wmt22-csen

Download All Recipes

for ri in wmt22-{csen,deen,jaen,ruen,zhen,frde,hren,liven,uken,ukcs,sahru}; do
  mtdata get-recipe -ri $ri -o $ri
done

Limitations:

Two datasets listed under WMT 22 page — CsEng2.0 and CCMT — require login and will not be downloaded using this tool.
Newstest 2021 is not supported yet. See current status (#116)

Usage: mtdata get-recipe

$  mtdata get-recipe  -h
usage: mtdata get-recipe [-h] -ri RECIPE_ID [-f] [-j N_JOBS] [--merge | --no-merge] [--compress] [-dd] [-dt] -o OUT_DIR

optional arguments:
  -h, --help            show this help message and exit
  -ri RECIPE_ID, --recipe-id RECIPE_ID
                        Recipe ID (default: None)
  -f, --fail-on-error   Fail on error (default: False)
  -j N_JOBS, --n-jobs N_JOBS
                        Number of worker jobs (processes) (default: 1)
  --merge               Merge train into a single file (default: True)
  --no-merge            Do not Merge train into a single file (default: False)
  --compress            Keep the files compressed (default: False)
  -dd, --dedupe, --drop-dupes
                        Remove duplicate (src, tgt) pairs in training (if any); valid when --merge. Not recommended for large datasets. (default: False)
  -dt, --drop-tests     Remove dev/test sentences from training sets (if any); valid when --merge (default: False)
  -o OUT_DIR, --out OUT_DIR
                        Output directory name (default: None)

5. Add/Customize a Recipe

Here is an example

- id: wmt22-deen (1)
  langs: deu-eng
  desc: WMT 22 General MT
  url: https://www.statmt.org/wmt22/translation-task.html
  dev:  (2)
    - Statmt-newstest_deen-2020-deu-eng
    - Statmt-newstest_ende-2020-eng-deu
  test: (2)
    #- Statmt-newstest_deen-2021-deu-eng
    #- Statmt-newstest_ende-2021-eng-deu
  train: (3)
    - Statmt-europarl-10-deu-eng
    - ParaCrawl-paracrawl-9-eng-deu
    - Statmt-commoncrawl_wmt13-1-deu-eng
    - Statmt-news_commentary-16-deu-eng
    - Statmt-wikititles-3-deu-eng
    - Tilde-rapid-2019-deu-eng # - Tilde-rapid-2016-deu-eng
    - Facebook-wikimatrix-1-deu-eng

id has to be unique.
dev and test are optional. They can be a single dataset (i.e. String) or list of datasets (i.e. list of strings)
train is required.

6. Issues / Bugs

Please report them using GitHub issues at github.com/thammegowda/mtdata .