The efficiency task measures latency, throughput, memory consumption, and size of machine translation on CPUs and GPUs. Participants provide their own code and models using standardized data and hardware. This is a continuation of the WMT 2021 Efficiency Shared Task.
We listened to the survey results. The task is to translate English to German following the constrained constrained news task from WMT 2021. Your system must be distilled from the provided teacher.
The teacher model is an ensemble of 4 transformer-big models, each with 6 encoder and 6 decoder layers. The teacher systems all use the same joint vocabulary. This is a 32k token SentencePiece model with extra symbols defined for tagging purposes. The available symbols are: <blank>,<mask>,<sep>,<type:backtr>,<type:nat>,<type:unk>,<domain:news>,<domain:other>,<domain:unk>,<lang:en>,<lang:de>,<lang:unk>,<misc0>,<misc1>,<misc2>,<misc3>,<misc4>,<misc5>,<misc6>,<misc7>,<misc8>,<misc9>
. This vocabulary documentation is only for information; your student may use the same vocabulary or come up with your own.
For convenience, we provide cleaned versions of the parallel (de+en) and monolingual (en) datasets, as well as the teacher distilled data from the parallel and monolingual cleaned English text. See the README for more information.
You may distill other constrained data from the WMT 2021 news task, clean a different way, change the way in which distillation from the teacher is performed, and of course build your own student. You may not use the data to build a better teacher; while that is possible, the survey results indicated participants prefer to explore how to make the best system from a given teacher.
There are GPU and CPU conditions. The GPU is one A100 via Oracle Cloud BM.GPU4.8 (but we will limit your Docker to one GPU) and the CPU is an Intel Ice Lake via Oracle Cloud BM.Optimized3.36.
Oracle cloud provides $1000 for research purposes including use of GPUs. We may also be able to provide you with a machine over SSH for limited amounts of time.
Participants can choose to submit for throughput (unlimited batch size), latency (batch size 1), or ideally both. The following conditions are open for submissions:
In the throughput setting, your program is provided with 1 million lines of input then outputs 1 million lines, with total time measured. You can use one A100 GPU in the GPU setting and all 36 cores in the CPU setting. Batch size is unlimited.
In the latency setting, the test harness will provide your system with one sentence on standard input and flush then wait for your system to provide a translation on its standard output (and flush) before providing the next sentence. The latency script is an example harness, though in practice we use C++. You can use one A100 GPU in the GPU setting or one CPU core in the CPU setting. Note that docker buffers IO by default, so it's easiest to run the wrapper inside docker.
/model/
.Results will be reported in a table showing all metrics. The presentation will include a series of Pareto frontiers comparing quality with each of the efficiency metrics.
You may perform initialization, such as decompressing models, as part of your docker start script (which will not have access to the input). The clock starts when input is provided. The large input is intended to amortize any lazy loading, which will not be subtracted.
Competitors should submit a Docker image with all of the software and model files necessary to perform translation.
/model
with all the model files as defined above./wmt
, which are reserved by the evaluation system./run.sh
as described below./run.sh $hardware $task <input >output
runs translation. The $hardware argument will be either "GPU", "CPU-1" (single CPU thread, no hyperthreads), or "CPU-ALL" (all CPU cores). The $task argument will be "latency" or "throughput". The input and output files, which will not necessarily have that name, are UTF-8 plain text separated by UNIX newlines. Each line of input should be translated to one line of output. For the latency task, we will actually run /wmt/latency.py /run.sh CPU-1 latency <input >output
(or the same with GPU instead).
As an example, here is the all CPU throughput condition:image_name="$(docker load -i ${image_file_path} |cut -d " " -f 3)"
container_id="$(docker run -itd ${opt_memory} --memory-swap=0 ${image_name} /bin/sh)"
(time docker exec -i "${container_id}" /run.sh CPU-ALL throughput) <input.txt >${result_directory}/run.stdout 2>${result_directory}/run.stderr
In the CPU-ALL condition, your docker container will be able to control CPU affinity so numactl
and taskset
will work (provided of course you include them in your container).
Multiple submission is encouraged. You can submit multiple Docker containers and indicate which conditions to run them with. Please include the name of your team in the name of the Docker file.
Post your Docker image online and send a sha512sum of the file to wmt at kheafield.com. If you need a place to upload to instead, contact us.
Submissions are due August 31, 2022 Anywhere on Earth. We follow the general paper deadlines for WMT 2022. Participants should submit system descriptions.
Kenneth Heafield
wmt at kheafield dot com