Bandit Learning for MT is a framework to train and improve MT systems by learning from weak or partial feedback: Instead of a gold-standard human-generated translation, the learner only receives feedback to a single proposed translation (this is why it is called partial), in form of a translation quality judgement (which can be as weak as a binary acceptance/rejection decision).
Amazon and University of Heidelberg organize this Shared Task with a goal to encourage researchers to investigate algorithms for learning from weak user feedback instead of from human references or post-edits that require skilled translators. We are interested in finding systems that learn efficiently and effectively from this type of feedback, i.e. they learn fast and achieve high translation quality. Developing such algorithms is interesting for interactive machine learning and for human feedback in NLP in general.
In the WMT task setup, the user feedback will be simulated by a service hosted on Amazon Web Services (AWS), where participants can submit translations and receive feedback and use this feedback for training an MT model. Reference translations will not be revealed at any point, also evaluations are done via the service.
Please find all details about setup, infrastructure, baselines, and final results in the 2017 shared task description paper.
@inproceedings{wmt_bandit_learning_task_2017, author = {Artem Sokolov and Julia Kreutzer and Kellen Sunderland and Pavel Danchenko and Witold Szymaniak and Hagen F\"{u}rstenau and Stefan Riezler}, title = {A Shared Task on Bandit Learning for Machine Translation}, booktitle = {Proceedings of the 2nd Conference on Machine Translation {(WMT)}}, address = {Copenhagen, Denmark}, month = sep, year = 2017 }
All dates are preliminary.
Registration via e-mail | |
Access to mock service | |
Access to development service | |
Leaderboard is available | |
Online learning starts | |
Notification of evaluation results | |
Paper submission deadline | |
Camera-ready deadline |
The name bandit is inherited from a model where in each round a gambler in a casino pulls an arm of a different slot machine, called "one-armed bandit", with the goal of maximizing his reward relative to the maximal possible reward, without apriori knowledge of the optimal slot machine. In MT, pulling an arm corresponds to proposing a translation; rewards correspond to user feedback on translation quality. Bandit learners can be seen as one-state Markov Decision Processes (MDPs), which connects them to reinforcement learning. In MT, proposing a translation corresponds to choosing an action.
Bandit learning follows an online learning protocol, where on each of a sequence of iterations, the learner receives a source sentence, predicts a translation, and receives a reward in form of a task loss evaluation of the predicted translation. The learner does not know what the correct prediction looks like, nor what would have happened if it had predicted differently.
For t = 1, ..., T doOnline interaction is done via accessing an AWS-hosted service that provides source sentences to the learner (step 1), and provides feedback (step 3) to the translation predicted by the learner (step 2). The learner updates its parameters using the feedback (step 4) and continues to the next example.
For training seed systems, out-of-domain parallel data shall be restricted to German-English Europarl, NewsCommentary, CommonCrawl and Rapid data for the News Translation (constrained) task; monolingual English data from the constrained task is allowed. Tuning of the out-of-domain system should be done on the 'newstest2016-deen' development set. It is recommended to use the same pre-processing as for the in-domain data (see below).
The in-domain sequence of data for online learning will be e-commerce domain provided by Amazon, pre-processed with Moses' scripts (removing non-printing characters, replacing and normalizing unicode punctuation, lowercasing, pre-tokenizing and tokenizing). Since the data comes from a substantially different domain, expect a large number of out-of-vocabulary terms. These data can only be accessed via the service. No reference translations will be revealed, only feedback to submitted translations is returned from the service.
Simulated reward-type real-valued feedback will be based on a combination of several quality models, including automatic measures with respect to human references (pre-processed in the same way), and will be normalized to the range [0,1] ('very bad' to 'excellent'). Feedback can only be accessed via the service. Only one feedback is allowed per source sentence.
The respective data samples will be the same for all participants.
The following main evaluation metrics will be used: