This document provides instructions for downloading WMT22 General MT task datasets for constrained track using mtdata
.
1. Setup
pip install mtdata==0.3.7
# pip install https://github.com/thammegowda/mtdata/archive/develop.zip # Install from develop branch
2. Get Recipes File
Config file for CONSTRAINED track (missing datasets behing registration: CzEng2.0, and CCMT):
wget https://www.statmt.org/wmt22/mtdata/mtdata.recipes.wmt22-constrained.yml
By default, the recipe file has to be in the current directory (where mtdata
is invoked) and the name has to match mtdata.recipes*.yml
glob. If you would like to place all your recipe YML files in a specific directory, then export MTDATA_RECIPES=/path/to/dir
If you are considering to participate in UNCONSTRAINED track, then any data is allowed. For example, you may use following config file containing larger set of corpora.
wget https://www.statmt.org/wmt22/mtdata/mtdata.recipes.wmt22-unconstrained.yml
3. List Available Recipes
$ mtdata list-recipe | cut -f1 | grep wmt22
wmt22-csen
wmt22-deen
wmt22-jaen
wmt22-ruen
wmt22-zhen
wmt22-frde
wmt22-hren
wmt22-liven
wmt22-uken
wmt22-ukcs
wmt22-sahru
wmt22*
ids are all loaded from mtdata.recipes.wmt22*.yml
file.
4. Download Recipes
# example: wmt22-csen
mtdata get-recipe -ri wmt22-csen -o wmt22-csen
for ri in wmt22-{csen,deen,jaen,ruen,zhen,frde,hren,liven,uken,ukcs,sahru}; do
mtdata get-recipe -ri $ri -o $ri
done
-
Two datasets listed under WMT 22 page — CsEng2.0 and CCMT — require login and will not be downloaded using this tool.
-
Newstest 2021 is not supported yet. See current status (#116)
mtdata get-recipe
$ mtdata get-recipe -h
usage: mtdata get-recipe [-h] -ri RECIPE_ID [-f] [-j N_JOBS] [--merge | --no-merge] [--compress] [-dd] [-dt] -o OUT_DIR
optional arguments:
-h, --help show this help message and exit
-ri RECIPE_ID, --recipe-id RECIPE_ID
Recipe ID (default: None)
-f, --fail-on-error Fail on error (default: False)
-j N_JOBS, --n-jobs N_JOBS
Number of worker jobs (processes) (default: 1)
--merge Merge train into a single file (default: True)
--no-merge Do not Merge train into a single file (default: False)
--compress Keep the files compressed (default: False)
-dd, --dedupe, --drop-dupes
Remove duplicate (src, tgt) pairs in training (if any); valid when --merge. Not recommended for large datasets. (default: False)
-dt, --drop-tests Remove dev/test sentences from training sets (if any); valid when --merge (default: False)
-o OUT_DIR, --out OUT_DIR
Output directory name (default: None)
5. Add/Customize a Recipe
Here is an example
- id: wmt22-deen (1)
langs: deu-eng
desc: WMT 22 General MT
url: https://www.statmt.org/wmt22/translation-task.html
dev: (2)
- Statmt-newstest_deen-2020-deu-eng
- Statmt-newstest_ende-2020-eng-deu
test: (2)
#- Statmt-newstest_deen-2021-deu-eng
#- Statmt-newstest_ende-2021-eng-deu
train: (3)
- Statmt-europarl-10-deu-eng
- ParaCrawl-paracrawl-9-eng-deu
- Statmt-commoncrawl_wmt13-1-deu-eng
- Statmt-news_commentary-16-deu-eng
- Statmt-wikititles-3-deu-eng
- Tilde-rapid-2019-deu-eng # - Tilde-rapid-2016-deu-eng
- Facebook-wikimatrix-1-deu-eng
-
id
has to be unique. -
dev
andtest
are optional. They can be a single dataset (i.e. String) or list of datasets (i.e. list of strings) -
train
is required.
6. Issues / Bugs
Please report them using GitHub issues at github.com/thammegowda/mtdata .