The efficiency task measures latency, throughput, memory consumption, and size of machine translation on CPUs and GPUs. Participants provide their own code and models using standardized data and hardware. This is a continuation of the WNGT 2020 Efficiency Shared Task.
Systems should translate from English to German following the constrained condition of the 2021 news task. Using another group's constrained model is permissible with citation.
There are GPU and CPU conditions. The GPU is one A100 via Oracle Cloud BM.GPU4.8 (but we will limit your Docker to one GPU) and the CPU is an Intel Ice Lake via Oracle Cloud BM.Optimized3.36. We went with Oracle Cloud because the Ice Lake is generally available while it's in preview on other providers.
Participants can choose to submit for latency, throughput, or ideally both.
Latency will be measured on the full GPU or one CPU core (with the rest of the CPU idle and an affinity constraint to limit to one core). The test harness will provide your system with one sentence on standard input and flush then wait for your system to provide a translation on its standard output (and flush) before providing the next sentence. The latency script is an example harness, though if systems are fast enough, we may rewrite it in C++.
Throughput will be measured on the GPU or entire 36-core CPU machine. If there is interest in a single-core CPU throughput task, we can also run that.
/model/
.Participants may not use the WMT21 source or target sentences in preparing their submission (though if you submitted to the news task, your quota of 7 Ocelot submissions is allowed).
Results will be reported in a table showing all metrics. The presentation will include a series of Pareto frontiers comparing quality with each of the efficiency metrics. We welcome participants optimizing any of the metrics.
We will not be subtracting loading time from run times. The large input is intended to amortize loading time.
Competitors should submit a Docker image with all of the software and model files necessary to perform translation.
/model
with all the model files as defined above./wmt
, which are reserved by the evaluation system./run.sh
as described below./run.sh $hardware $task <input >output
runs translation. The $hardware argument will be either "GPU", "CPU-1" (single CPU thread, no hyperthreads), or "CPU-ALL" (all CPU cores). The $task argument will be "latency" or "throughput". The input and output files, which will not necessarily have that name, are UTF-8 plain text separated by UNIX newlines. Each line of input should be translated to one line of output. For the latency task, we will actually run /wmt/latency.py /run.sh CPU-1 latency <input >output
(or the same with GPU instead).
As an example, here is the single CPU throughput condition:image_name="$(docker load -i ${image_file_path} |cut -d " " -f 3)"
container_id="$(docker run -itd --cpuset-cpus=0 ${opt_memory} --memory-swap=0 ${image_name} /bin/sh)"
(time docker exec -i "${container_id}" /run.sh CPU-1 throughput) <input.txt >${result_directory}/run.stdout 2>${result_directory}/run.stderr
In the CPU-ALL condition, your docker container will be able to control CPU affinity so numactl
and taskset
will work (provided of course you include them in your container).
Participants in past editions should note the arguments to /run.sh
have changed.
Multiple submission is encouraged. You can submit multiple Docker containers and indicate which conditions to run them with. Please include the name of your team in the name of the Docker file.
Post your Docker image online and send a sha512sum of the file to wmt at kheafield.com. If you need a place to upload to instead, contact us.
Kenneth Heafield
wmt at kheafield dot com