Shared Task: Efficiency

Efficiency Task

The efficiency task measures latency, throughput, memory consumption, and size of machine translation on CPUs and GPUs. Participants provide their own code and models using standardized data and hardware. This is a continuation of the WMT 2021 Efficiency Shared Task.

Mandatory Teacher

We listened to the survey results. The task is to translate English to German following the constrained constrained news task from WMT 2021. Your system must be distilled from the provided teacher.

The teacher model is an ensemble of 4 transformer-big models, each with 6 encoder and 6 decoder layers. The teacher systems all use the same joint vocabulary. This is a 32k token SentencePiece model with extra symbols defined for tagging purposes. The available symbols are: <blank>,<mask>,<sep>,<type:backtr>,<type:nat>,<type:unk>,<domain:news>,<domain:other>,<domain:unk>,<lang:en>,<lang:de>,<lang:unk>,<misc0>,<misc1>,<misc2>,<misc3>,<misc4>,<misc5>,<misc6>,<misc7>,<misc8>,<misc9>. This vocabulary documentation is only for information; your student may use the same vocabulary or come up with your own.

For convenience, we provide cleaned versions of the parallel (de+en) and monolingual (en) datasets, as well as the teacher distilled data from the parallel and monolingual cleaned English text. See the README for more information.

You may distill other constrained data from the WMT 2021 news task, clean a different way, change the way in which distillation from the teacher is performed, and of course build your own student. You may not use the data to build a better teacher; while that is possible, the survey results indicated participants prefer to explore how to make the best system from a given teacher.

Hardware

There are GPU and CPU conditions. The GPU is one A100 via Oracle Cloud BM.GPU4.8 (but we will limit your Docker to one GPU) and the CPU is an Intel Ice Lake via Oracle Cloud BM.Optimized3.36.

Oracle cloud provides $1000 for research purposes including use of GPUs. We may also be able to provide you with a machine over SSH for limited amounts of time.

Latency and Throughput

Participants can choose to submit for throughput (unlimited batch size), latency (batch size 1), or ideally both. The following conditions are open for submissions:

One A100 GPU for throughput
One A100 GPU for latency
36 cores of Ice Lake CPU for throughput
1 core of Ice Lake CPU for latency

In the throughput setting, your program is provided with 1 million lines of input then outputs 1 million lines, with total time measured. You can use one A100 GPU in the GPU setting and all 36 cores in the CPU setting. Batch size is unlimited.

In the latency setting, the test harness will provide your system with one sentence on standard input and flush then wait for your system to provide a translation on its standard output (and flush) before providing the next sentence. The latency script is an example harness, though in practice we use C++. You can use one A100 GPU in the GPU setting or one CPU core in the CPU setting. Note that docker buffers IO by default, so it's easiest to run the wrapper inside docker.

Measurement

Each system will be run with 1 million lines of raw English input, where each line has at most 150 space-separated words (though your tokenizer will probably break that into more). We will measure the following:

Quality on a WMT test set scattered through the input. We may conduct human evaluation.
Quality on other undisclosed test sets scattered through the input.
Real time taken by the entire translation command.
Peak host RAM consumption.
Peak GPU RAM consumption in the GPU track.
Size of the model on disk. Docker images must include a separate model directory in /model/.
Size of the docker on disk.

Results will be reported in a table showing all metrics. The presentation will include a series of Pareto frontiers comparing quality with each of the efficiency metrics.

You may perform initialization, such as decompressing models, as part of your docker start script (which will not have access to the input). The clock starts when input is provided. The large input is intended to amortize any lazy loading, which will not be subtracted.

What is a model?

We will report model size on disk, which means we need to define a model as distinct from code. The model includes everything derived from data: all model parameters, vocabulary files, BPE configuration if applicable, quantization parameters or lookup tables where applicable, and hyperparameters like embedding sizes. You may compress your model using standard tools (gzip, bz2, xzip, etc.) and the compressed size will be reported. Code can include simple rule-based tokenizer scripts and hard-coded model structure that could plausibly be used for another language pair. If we suspect that your model is hidden in code, we may ask you to provide another model of comparable size for a surprise language pair with reasonable quality.

Docker submission

Competitors should submit a Docker image with all of the software and model files necessary to perform translation.

Include a short name for your team (without spaces) in the docker file name and image name.
The image must contain a model directory /model with all the model files as defined above.
Competitors can also add any other directories and files in the image, except any paths starting with /wmt, which are reserved by the evaluation system.
Include a /run.sh as described below.

/run.sh $hardware $task <input >output runs translation. The $hardware argument will be either "GPU", "CPU-1" (single CPU thread, no hyperthreads), or "CPU-ALL" (all CPU cores). The $task argument will be "latency" or "throughput". The input and output files, which will not necessarily have that name, are UTF-8 plain text separated by UNIX newlines. Each line of input should be translated to one line of output. For the latency task, we will actually run /wmt/latency.py /run.sh CPU-1 latency <input >output (or the same with GPU instead).

As an example, here is the all CPU throughput condition:
image_name="$(docker load -i ${image_file_path} |cut -d " " -f 3)" container_id="$(docker run -itd ${opt_memory} --memory-swap=0 ${image_name} /bin/sh)" (time docker exec -i "${container_id}" /run.sh CPU-ALL throughput) <input.txt >${result_directory}/run.stdout 2>${result_directory}/run.stderr

In the CPU-ALL condition, your docker container will be able to control CPU affinity so numactl and taskset will work (provided of course you include them in your container).

Multiple submission is encouraged. You can submit multiple Docker containers and indicate which conditions to run them with. Please include the name of your team in the name of the Docker file.

Post your Docker image online and send a sha512sum of the file to wmt at kheafield.com. If you need a place to upload to instead, contact us.

Deadlines

Submissions are due August 31, 2022 Anywhere on Earth. We follow the general paper deadlines for WMT 2022. Participants should submit system descriptions.

Contact

Kenneth Heafield
wmt at kheafield dot com