Using the open source Moses system it is possible to build a baseline system that is competetive with results from last year's workshop. What follows below are step-by-step instructions. This may look like a long list at first glance, but it should make it straightforward to build a machine translation system and all its components, and it should make the process of tuning, testing, and evaluating it transparent.
Note: The build and install instructions for Moses are out-of-date. Please refer to the Moses website for an updated version.
cd giza-pp
make
mkdir -p bin
cp GIZA++-v2/GIZA++ bin/
cp GIZA++-v2/snt2cooc.out bin/
cp mkcls-v2/mkcls bin/
mkdir -p moses
svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses
cd moses
./regenerate-makefiles.sh
./configure --with-srilm=/path-to-srilm
make
mkdir -p bin/moses-scripts
###Edit moses/scripts/Makefile
TARGETDIR=/full-path-to-workspace/bin/moses-scripts
BINDIR=/full-path-to-workspace/bin
###
cd moses/scripts/
make release
bin/moses-scripts/scripts-YYYYMMDD-HHMM
with released versions of all the scripts. You will call these versions when training/tuning Moses.make release
should indicate this.
export SCRIPTS_ROOTDIR=/full-path-to-workspace/bin/moses-scripts/scripts-YYYYMMDD-HHMM
tar xzf scripts.tgz
scripts/tokenizer.perl
scripts/lowercase.perl
scripts/wrap-xml.perl
wget ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
mkdir -p working-dir/corpus
scripts/tokenizer.perl -l fr < wmt08/training/europarl-v3.fr-en.fr > working-dir/corpus/europarl.tok.fr
scripts/tokenizer.perl -l en < wmt08/training/europarl-v3.fr-en.en > working-dir/corpus/europarl.tok.en
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/clean-corpus-n.perl working-dir/corpus/europarl.tok fr en working-dir/corpus/europarl.clean 1 40
scripts/lowercase.perl < working-dir/corpus/europarl.clean.fr > working-dir/corpus/europarl.lowercased.fr
scripts/lowercase.perl < working-dir/corpus/europarl.clean.en > working-dir/corpus/europarl.lowercased.en
mkdir -p working-dir/lm
scripts/tokenizer.perl -l en < wmt08/training/europarl-v3.en > working-dir/lm/europarl.tok
scripts/lowercase.perl < working-dir/lm/europarl.tok > working-dir/lm/europarl.lowercased
/path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:working-dir/lm/europarl.lm:0
mkdir -p working-dir/tuning
scripts/tokenizer.perl -l fr < wmt08/dev/dev2006.fr > working-dir/tuning/input.tok
scripts/tokenizer.perl -l en < wmt08/dev/dev2006.en > working-dir/tuning/reference.tok
scripts/lowercase.perl < working-dir/tuning/input.tok > working-dir/tuning/input
scripts/lowercase.perl < working-dir/tuning/reference.tok > working-dir/tuning/reference
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl working-dir/tuning/input working-dir/tuning/reference moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir working-dir/tuning --rootdir bin/moses-scripts/scripts-YYYYMMDD-HHMM
scripts/reuse-weights.perl working-dir/tuning/moses.ini < working-dir/model/moses.ini > working-dir/tuning/moses.weight-reused.ini
mkdir -p working-dir/evaluation
scripts/tokenizer.perl -l fr < wmt08/devtest/devtest2006.fr > working-dir/evaluation/devtest2006.input.tok
scripts/tokenizer.perl -l en < wmt08/devtest/devtest2006.en > working-dir/evaluation/devtest2006.reference.tok
scripts/lowercase.perl < working-dir/evaluation/devtest2006.input.tok > working-dir/evaluation/devtest2006.input
scripts/lowercase.perl < working-dir/evaluation/devtest2006.reference.tok > working-dir/evaluation/devtest2006.reference
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/filter-model-given-input.pl working-dir/evaluation/filtered.devtest2006 working-dir/tuning/moses.weight-reused.ini working-dir/evaluation/devtest2006.input
moses/moses-cmd/src/moses -config working-dir/evaluation/filtered.devtest2006/moses.ini -input-file working-dir/evaluation/devtest2006.input > working-dir/evaluation/devtest2006.output
bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/train-recaser.perl -train-script bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -ngram-count /path-to-srilm/bin/i686/ngram-count -corpus working-dir/lm/europarl.tok -dir recaser
bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/recase.perl -model recaser/moses.ini -in working-dir/evaluation/devtest2006.output -moses moses/moses-cmd/src/moses > working-dir/evaluation/devtest2006.output.recased
scripts/detokenizer.perl -l en < working-dir/evaluation/devtest2006.output.recased > working-dir/evaluation/devtest2006.output.detokenized
scripts/wrap-xml.perl wmt08/devtest/devtest2006-ref.en.sgm en < working-dir/evaluation/devtest2006.output.detokenized > working-dir/evaluation/devtest2006.output.sgm
mteval-v11b.pl -r wmt08/devtest/devtest2006-ref.en.sgm -t working-dir/evaluation/devtest2006.output.sgm -s wmt08/devtest/devtest2006-src.fr.sgm -c
supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7