ACL 2005 WORKSHOP ON
BUILDING AND USING PARALLEL TEXTS:
DATA-DRIVEN
MACHINE TRANSLATION AND BEYOND
June 29-30, 2005
http://www.statmt.org/wpt05/
Workshop Program
The goal of this workshop is to provide a forum for researchers
working on problems related to the creation and use of parallel
text. Recent events have demonstrated once again the importance of
inter-language communication across a broad range of languages. This
reinforces the need for advances in machine translation (MT) and
multi-lingual processing tools, especially for languages with scarce
resources.
This is a two-day workshop featuring two tracks:
- Building and Using Parallel Texts for Languages
with Scarce Resources (day 1)
- Exploiting Parallel Texts for Statistical Machine
Translation (day 2)
Both tracks feature a shared task each, that allows participants to
compare their results on a common task. Although not required, we
encourage submissions to participate in the shared tasks for
bench-marking purposes.
TRACK DESCRIPTIONS
1. BUILDING AND USING PARALLEL TEXTS FOR LANGUAGES WITH SCARCE RESOURCES
The aim of this track is to bring together researchers involved in the
study of creating and using parallel corpora for minority
languages. The track will be therefore centered around issues related
to manual/automatic collection of parallel corpora, studies in the
"import" of knowledge from a well-studied language via parallel
alignments, evaluations of the quality of collected corpora or the
quality of the tools that are derived based on these corpora.
We invite submissions of papers addressing any of the following issues:
- Construction of parallel corpora, including the automatic
identification and harvesting of parallel corpora from the Web
- Tools for processing parallel corpora, including automatic
sentence alignment, word alignment, phrase alignment, detection of
omissions and gaps in translations, and others
- Methods to evaluate the quality of parallel corpora and word alignments
- Using parallel corpora for the derivation of language processing tools
in new languages
- Using parallel corpora for automatic corpus annotation (e.g. word
sense disambiguation)
- Using parallel corpora for cross-language information retrieval and
extraction
- The quality of language resources and systems that can be constructed
with small amounts of parallel text and how do these scale up with the
amount of text available.
- The role of external knowledge sources (e.g. bilingual dictionaries)
in building resources and systems relying on parallel texts.
- Machine learning techniques for building and exploiting parallel texts
(e.g. using small amounts of human-aligned parallel text to bootstrap
large aligned corpora; active selection of data based on usefulness
for different tasks)
While we invite submissions addressing any of the above topics, or related
issues, we particularly welcome work involving parallel corpora addressing
languages with scarce resources.
Shared task
In addition to regular paper presentations, the track will also include
a shared task for the evaluation of various word alignment techniques.
Word alignment represents an important step in exploiting parallel corpora,
and yet there is no common evaluation framework for such systems. This
follows on the success of the word alignment task that took place as a part
of the NAACL 2003 workshop on parallel text. This year's edition will be
distinct in that it will focus on Inuktitut-English and Romanian-English
alignment. This fits into the theme of our track, since neither Inuktitut
nor Romanian is a widely studied language, and there are relatively few online
resources and tools available.
Teams that participate in the alignment exercise will be provided the
training data for each language pair and development data taken from the
gold standard data in order to build their systems. Thereafter they will
be provided the unaligned gold standard data and asked to submit their
proposed alignments in a short time frame. There will be two tracks
for each language pair, one for teams that augment the training data with
additional resources, and another for those that only use the training
data. The resulting alignments will be evaluated relative to the previously
mentioned gold standard data prior to the workshop. Short papers describing
systems participating in this shared task and all evaluation methodologies
employed will constitute a separate section in the workshop proceedings.
A more detailed description, training, development, and test data, and
a number of other related resources will be made available from
http://www.cs.unt.edu/~rada/wpt05.
2. EXPLOITING PARALLEL TEXTS FOR STATISTICAL MACHINE TRANSLATION
The focus of this track is to use parallel corpora for machine
translation.
Translating documents from foreign languages into English (or between
any two languages) by computer is one of the oldest goals in
computational linguistics. Now, armed with vast amounts of digitally
available translated text and powerful computers, we are witnessing
significant progress toward achieving that goal. Statistical methods
allow the analysis of parallel text corpora and the automatic
construction of machine translation systems. Already, for some
language pairs such as Chinese-English or Arabic-English, statistical
machine translation (SMT) systems built at research labs outperform
commercial systems.
Recent experimentation has shown that the performance of SMT systems
varies greatly with the source language. In this workshop we would
like to encourage researchers to investigate ways to improve the
performance of SMT systems for diverse languages, including
morphologically complex languages (e.g., Finnish) and languages with
partial free word order (e.g., German). These issues lie on the border
of linguistic analysis and statistical modeling, and the ACL
conference is the most appropriate forum to investigate them, as ACL
has a long tradition of hosting high-quality research in both areas.
Topics of interest include, but are not limited to:
- word-based, chunk-based, phrase-based, syntax-based SMT
- using comparable corpora for SMT
- using morphological and POS information for SMT
- integration of rule-based MT and statistical MT
- decoding
- error analysis
In addition to submissions on the topics listed above, this track of
the workshop features a shared task and we encourage participants to
evaluate their approaches on that task. The shared task is to evaluate
your approach to machine translation---see the list of topics of
interests above---on the Europarl corpus.
A more detailed description of the shared task, the test and training
corpora, a freely available MT system, and a number of other resources
are available from
http://www.statmt.org/wpt05/mt-shared-task/
SUBMISSION INFORMATION
Submissions will consist of regular full papers of max. 8 pages,
formatted following the ACL 2005 guidelines. Authors of regular
full papers will be required to indicate a track for their submission.
In addition, teams participating in the shared tasks will be invited
to submit short papers (max. 4 pages) describing their systems.
Both submission and review processes will be handled electronically.
IMPORTANT DATES
Regular paper submissions | April 10 |
(shared task) Results submissions | April 10 |
(shared task) Short paper submissions | April 17 |
Notification (short and regular papers) | May 4 |
Camera-ready papers | May 15 |
ORGANIZERS
Philipp Koehn (University of Edinburgh)
Joel Martin (National Research Council of Canada)
Rada Mihalcea (University of North Texas)
Christof Monz (University of Maryland)
Ted Pedersen (University of Minnesota, Duluth)
CONTACT
For questions, comments, etc. please send email to
wpt05@umiacs.umd.edu.
PROGRAM COMMITTEE
Lars Ahrenberg (Linkoping University)
Bill Byrne (University of Cambridge)
Chris Callison-Burch (University of Edinburgh)
Nicoletta Calzolari (University of Pisa)
Francisco Casacuberta (University of Valencia)
David Chiang (University of Maryland)
Mona Diab (Columbia University)
George Foster (Canada National Research Council)
Alexander Fraser (ISI/University of Southern California)
Pascale Fung (Hong Kong University of Science and Technology)
Rob Gaizauskas (University of Sheffield)
Ulrich German (University of Toronto)
Dan Gildea (University of Rochester)
Jan Hajic (Charles University)
Andrew Hardie (University of Lancaster)
Rebecca Hwa (University of Pittsburgh)
Nancy Ide (Vassar College)
Kevin Knight (ISI/University of Southern California)
Greg Kondrak (University of Alberta)
Roland Kuhn (Canada National Research Council)
Shankar Kumar (Johns Hopkins University)
Philippe Langlais (University of Montreal)
Alon Lavie (Carnegie Mellon University)
Lori Levin (Carnegie Mellon University)
Daniel Marcu (ISI/University of Southern California)
Tony McEnery (University of Lancaster)
Bridget McInnes (University of Minnesota)
Magnus Merkel (Linkoping University)
Bob Moore (Microsoft Research)
Herman Ney (RWTH Aachen)
Maria das Gracas Volpe Nunes (University of Sao Paulo)
Franz-Josef Och (Google)
Kemal Oflazer (Sabanci University)
Miles Osborne (University of Edinburgh)
Andrei Popescu-Belis (University of Geneva)
Katharina Probst (CMU)
Amruta Purandare (University of Pittsburgh)
Florence Reeder (MITRE)
Philip Resnik (University of Maryland)
Antonio Ribeiro (European Commission Joint Research Council)
Michel Simard (Xerox)
Kevin Scannell (St. Louis University)
Libin Shen (University of Pennsylvania)
Eiichiro Sumita (ATR Spoken Language Translation Research Lab)
Joerg Tiedemann (University of Groningen)
Christoph Tillmann (IBM)
Hajime Tsukada (NTT Communication Science Laboratories)
Dan Tufis (Research Institute for AI of the Romanian Academy)
Jean Veronis (Universite de Provence)
Michelle Vanni (Army Research Lab)
Stephan Vogel (Carnegie Mellon University)
Clare Voss (Army Research Lab)
Taro Watanabe (ATR Spoken Language Translation Research Laboratories)
Dekai Wu (Hong Kong University of Science and Technology)