Using the open source Moses system it is possible to build a baseline system that is competetive with results from last year's workshop. What follows below are step-by-step instructions. This may look like a long list at first glance, but it should make it straightforward to build a machine translation system and all its components, and it should make the process of tuning, testing, and evaluating it transparent.
Note: The build and install instructions for Moses are out-of-date. Please refer to the Moses website for an updated version.
cd giza-pp
make
mkdir -p bin
cp GIZA++-v2/GIZA++ bin/
cp GIZA++-v2/snt2cooc.out bin/
cp mkcls-v2/mkcls bin/
mkdir -p moses
svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses
cd moses
./regenerate-makefiles.sh
./configure --with-srilm=/path-to-srilm
make
mkdir -p bin/moses-scripts
###Edit moses/scripts/Makefile
TARGETDIR=/full-path-to-workspace/bin/moses-scripts
BINDIR=/full-path-to-workspace/bin
###
cd moses/scripts/
make release
bin/moses-scripts/scripts-YYYYMMDD-HHMM
with released versions of all the scripts. You will call these versions when training/tuning Moses.make release
should indicate this.
export SCRIPTS_ROOTDIR=/full-path-to-workspace/bin/moses-scripts/scripts-YYYYMMDD-HHMM
tar xzf scripts.tgz
scripts/tokenizer.perl
scripts/lowercase.perl
scripts/wrap-xml.perl
wget ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
mkdir -p working-dir/corpus
scripts/tokenizer.perl -l fr < training/europarl-v6.fr-en.fr > working-dir/corpus/europarl.tok.fr
scripts/tokenizer.perl -l en < training/europarl-v6.fr-en.en > working-dir/corpus/europarl.tok.en
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/clean-corpus-n.perl working-dir/corpus/europarl.tok fr en working-dir/corpus/europarl.clean 1 40
scripts/lowercase.perl < working-dir/corpus/europarl.clean.fr > working-dir/corpus/europarl.lowercased.fr
scripts/lowercase.perl < working-dir/corpus/europarl.clean.en > working-dir/corpus/europarl.lowercased.en
mkdir -p working-dir/lm
scripts/tokenizer.perl -l en < training-monolingual/europarl-v6.en > working-dir/lm/europarl.tok
scripts/lowercase.perl < working-dir/lm/europarl.tok > working-dir/lm/europarl.lowercased
/path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-model.perl -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:working-dir/lm/europarl.lm:0
mkdir -p working-dir/tuning
perl -ne 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;'
< dev/newstest2009-src.fr.sgm > dev/newstest2009.fr
perl -ne 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;'
< dev/newstest2009-ref.en.sgm > dev/newstest2009.en
scripts/tokenizer.perl -l fr < dev/newstest2009.fr > working-dir/tuning/input.tok
scripts/tokenizer.perl -l en < dev/newstest2009.en > working-dir/tuning/reference.tok
scripts/lowercase.perl < working-dir/tuning/input.tok > working-dir/tuning/input
scripts/lowercase.perl < working-dir/tuning/reference.tok > working-dir/tuning/reference
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl working-dir/tuning/input working-dir/tuning/reference moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir working-dir/tuning --rootdir bin/moses-scripts/scripts-YYYYMMDD-HHMM
scripts/reuse-weights.perl working-dir/tuning/moses.ini < working-dir/model/moses.ini > working-dir/tuning/moses.weight-reused.ini
mkdir -p working-dir/evaluation
perl -ne 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;'
< dev/newstest2010-src.fr.sgm > dev/newstest2010.fr
perl -ne 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;'
< dev/newstest2010-ref.en.sgm > dev/newstest2010.en
scripts/tokenizer.perl -l fr < dev/newstest2010.fr > working-dir/evaluation/newstest2010.input.tok
scripts/tokenizer.perl -l en < dev/newstest2010.en > working-dir/evaluation/newstest2010.reference.tok
scripts/lowercase.perl < working-dir/evaluation/newstest2010.input.tok > working-dir/evaluation/newstest2010.input
scripts/lowercase.perl < working-dir/evaluation/newstest2010.reference.tok > working-dir/evaluation/newstest2010.reference
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/filter-model-given-input.pl working-dir/evaluation/filtered.newstest2010 working-dir/tuning/moses.weight-reused.ini working-dir/evaluation/newstest2010.input
moses/moses-cmd/src/moses -config working-dir/evaluation/filtered.newstest2010/moses.ini -input-file working-dir/evaluation/newstest2010.input > working-dir/evaluation/newstest2010.output
bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/train-recaser.perl -train-script bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-model.perl -ngram-count /path-to-srilm/bin/i686/ngram-count -corpus working-dir/lm/europarl.tok -dir recaser
bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/recase.perl -model recaser/moses.ini -in working-dir/evaluation/newstest2010.output -moses moses/moses-cmd/src/moses > working-dir/evaluation/newstest2010.output.recased
scripts/detokenizer.perl -l en < working-dir/evaluation/newstest2010.output.recased > working-dir/evaluation/newstest2010.output.detokenized
scripts/wrap-xml.perl dev/newstest2010-ref.en.sgm en < working-dir/evaluation/newstest2010.output.detokenized > working-dir/evaluation/newstest2010.output.sgm
mteval-v11b.pl -r dev/newstest2010-ref.en.sgm -t working-dir/evaluation/newstest2010.output.sgm -s dev/newstest2010-src.fr.sgm -c
supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7