Statistical Machine Translation

Commented Bibliography

The following is a comprehensive bibliography of research publications in the field of statistical machine translation. The current version of this page is automatically converted from the LaTeX sources of the "Further Readings" sections of the book. We hope to extend it with more recent references (2009 onward) to keep it up to date.

Some citations have links to pdf files. This web page allows adding such links — any help with this is very much appeciated. You may also add cluster ids for Google Scholar.

Chapter 1: Introduction

1.1. General introduction
For non-statistical methods to machine translation, refer to the books by Arnold et al (1994) and by Hutchins and Somers (1992). For the history on machine translation, Hutchins (2007) gives a concise overview. Gaspari and Hutchins (2007) reports on the recent rise of online machine translation services and usage patterns. See also the famous alpac report (Pierce and Carroll, 1966). A recent survey of work in statistical machine translation is presented by Lopez (2008).

1.2. Statistical machine translation and other methods
Statistical machine translation is related to other data-driven methods in machine translation, such as the earlier work on example-based machine translation (Somers, 1999). Contrast this to systems that are based on hand-crafted rules. The distinctions between these categories are getting blurred. Borrowing from different approaches is called the hybrid approach to machine translation. Statistical methods, especially the use of the language model, may be integrated in rule-based systems (Knight et al, 1994; Habash and Dorr, 2002). Parsing and translation decisions may be learned from data (Hermjakob and Mooney, 1997). Multiple scoring functions may decide between the alternatives generated by the transfer-based system (Carl, 2007). Statistical methods may be used to learn rules for traditional rule-based systems (Llitjós and Vogel, 2007). Conversely, translations from rule-based systems may be used as additional phrase translations in statistical systems (Chen et al, 2007). Rule-based systems may be used to generate training data for statistical methods (Hu et al, 2007), essentially having the statistical method relearn the rule-based system (Dugast et al, 2008). Often statistical machine translation is used as a fall-back for methods that frequently fail to produce output, but are more accurate when they do Chai et al (2006). Statistical machine translation models may be also used to automatically post-edit the output of interlingual (Seneff et al, 2006) or rule-based systems (Simard et al, 2007). Additional markup from the rule-based system may be exploited for tighter integration (Ueffing et al, 2008). Such post-editing may alleviate the need to customize rule-based systems to a specific domain (Isabelle et al, 2007). Related to this, refer to the further readings on system combination in Chapter 9. Labaka et al (2007) compare different machine translation approaches for the Basque–Spanish language pair.

1.3. Available software
Popular implementations of statistical machine translations are Pharaoh (Koehn, 2004), which was succeeded by the open source Moses (Koehn et al, 2007). Phramer (Olteanu et al, 2006) is a Java implementation of a phrase-based statistical system. Thot (Ortiz-Martínez et al, 2005) is a toolkit to train phrase-based models for these decoders. Available implementations of tree-based decoders are Hiero (Chiang et al, 2005) and SAMT (Zollmann et al, 2007). GIZA++ (Och and Ney, 2003) and MTTK (Deng and Byrne, 2006) are tools for word alignment, Hunalign ( (Varga et al, 2005) is a popular tool for sentence alignment. Cattoni et al (2006) present a web demo for their phrase-based system. An online tool for machine translation evaluation is presented by Eck et al (2006). Yawat is a web-based tool to view and create word alignments (Germann, 2007; Germann, 2008).

1.4. Competitions
The development of statistical machine translation systems is driven by translation competitions, most notable annual competitions organized by DARPA on Arabic–English and Chinese–English in an open news domain (since 2001). The International Workshop on Spoken Language Translation (IWSLT) (Akiba et al, 2004; Eck and Hori, 2005; Paul, 2006; Fordyce, 2007) is an annual competition on translating Asian languages and Arabic into English in a more limited speech travel domain. The WMT campaign is a competition on European languages mostly using the European Parliament proceedings (Koehn and Monz, 2005; Koehn and Monz, 2006; Callison-Burch et al, 2007; Callison-Burch et al, 2008).

1.5. Research groups
Statistical machine translation models have been developed by a number of research groups, such as:

1.6. Speech translation
There has been increasing interest in integrating speech recognition and machine translation (Zhang et al, 2004). In speech-to-speech translation systems, a speech generator needs to be added as final step (Sumita et al, 2007). Vilar et al (2005) gives an overview of the work of integrating speech recognition and machine translation in the TC-Star project. Often n-best lists are passed from the speech recognizer to the translation system (Lee et al, 2007). Interfacing the output of a speech recognizer and a machine translation system raises problems such as sentence segmentation and punctuation prediction (Matusov et al, 2006), adding this to the pipeline may require extension of the lattice that is passed to the machine translation component (Bertoldi et al, 2007). For limited domains, real-time speech-to-speech systems have been implemented and deployed in the field, such as for force protection (Bach et al, 2007). It is not too far of a step to simultaneously translate speech into multiple languages (Pérez et al, 2007; Pérez et al, 2007).

1.7. Computer aided translation
By giving human translators access to the inner workings of machine translation system, they may fix errors at various stages, such as changing the source sentences or its linguistic analysis (Varga and Yokoyama, 2007). The TransType project (Langlais et al, 2000; Foster et al, 2002; Bender et al, 2005) developed an interactive translation tool which predicts the most appropriate extension of a partial translation by quick re-translation based on user input (Tomás and Casacuberta, 2006). Word graphs allow for quicker re-translations (Och et al, 2003; Civera et al, 2004; Civera et al, 2006) and confidence metrics indicate how much should be presented to the user as reliable prediction (Ueffing and Ney, 2005; Ueffing and Ney, 2007). Macklovitch (2004) describes how users interacted with the TransType tool. Conversely, the input to a translation system may be automatically examined for phrases that are difficult to translate (Mohit and Hwa, 2007). A user of a translation tool would like to have types of information, such as suggested translations for idioms, unknown words, and names (Abekawa and Kageura, 2007). Macklovitch et al (2005) presents a tool that visualizes the user interactions. Human post-editing data may be mined to improve the performance of machine translation, as shown for transfer-based systems (Llitjos et al, 2007). Machine translation may also be used for interactive tutoring tools for foreign language learners (Wang and Seneff, 2007). Macklovitch (1994) shows how alignment methods may be used to spot error in human translations.

Chapter 2: Words, Sentences, Corpora

2.1. General background
There are several textbooks on natural language processing that may serve as background to the material presented here. Good general introductions are given by Manning and Schütze (1999) as well as Jurafsky and Martin (2008).

2.2. Collecting parallel corpora
The web is one of the main sources for parallel corpora today. Resnik (1999) describes a method to automatically find such data. Fukushima et al (2006) use a dictionary to detect parallel documents, while Li and Liu (2008) use a number of criteria such as similarity of the URL and page content. Acquiring parallel corpora, however, typically requires some manual involvement (Martin et al, 2003; Koehn, 2005), including the matching of documents (Utiyama and Isahara, 2003). Uchiyama and Isahara (2007) report on the efforts to build a Japanese–English patent corpus and Macken et al (2007) on efforts on a braod-based Dutch–English corpus. A discussion of the pitfalls during the construction of parallel corpora is given by Kaalep and Veskis (2007). Parallel corpora may also be built by translating text for this purpose (Germann, 2001). It may be useful to focus on the most relevant new sentences, using methods such active learning (Majithia et al, 2005). Translation memories may also be a useful training resource (Langlais and Simard, 2002). Other methods focus on fishing the web for the translation of particular terms (Nagata et al, 2001) or noun phrases (Cao and Li, 2002). It is not clear, if it matters in which translation direction the parallel corpus was constructed, of if both sides were translated from a third language. Halteren (2008) shows that it is possible to reliably detect the source language in English texts from the European Parliament proceedings, so the original source language does have some effect.

2.3. Pivot languages
Machine translation systems for a language pair may also be generated when resources in connection to a pivot (or bridge) language exist, for instance parallel corpora from each language into the pivot language (Callison-Burch and Osborne, 2003). Pivoting works better if the bridge language is closely related to the other two languages (Babych et al, 2007). It may be better to combine translation tables than simply tying machine translation systems together (Utiyama and Isahara, 2007). By using pivot languages in triangulation, existing phrase translation tables may be enriched, leading to better translation quality (Wu and Wang, 2007), especially when using multiple pivot languages (Cohn and Lapata, 2007). When constructing translation dictionaries via a bridge language (Tsunakawa et al, 2008), additional resources such as ontologies may help (Varga and Yokoyama, 2007).

2.4. Sentence alignment
Sentence alignment was a very active field of research in the early days of statistical machine translation. We present in this book a method based on sentence length, measured in words (Brown et al, 1991; Gale and Church, 1991; Gale and Church, 1993) or characters (Church, 1993). Other methods may use alignment chains (Melamed, 1996; Melamed, 1999), model omissions (Melamed, 1996), distinguish between large-scale segmentation of text an detailed sentence alignment (Simard and Plamondon, 1996), apply line detection method from image processing to detect large-scale alignment patterns (Chang and Chen, 1997; Melamed, 1997). Kay and Röscheisen (1993) propose an iterative algorithm that uses spelling similarity and word co-occurrences to drive sentence alignment. Several researchers proposed including lexical information (Chen, 1993; Dagan et al, 1993; Utsuro et al, 1994; Wu, 1994; Haruno and Yamazaki, 1996; Chuang and Chang, 2002; Kueng and Su, 2002; Moore, 2002; Nightingale and Tanaka, 2003; Aswani and Gaizauskas, 2005), content words (Papageorgiou et al, 1994), numbers and n-grams (Davis et al, 1995). Sentence alignment may also be improved by a third language in multilingual corpora (Simard, 1999). More effort is needed to align very noisy corpora (Zhao et al, 2003). Different sentence alignment methods are compared by Singh and Husain (2005). Xu et al (2006) propose a method that iteratively performs binary splits of a document to obtain a sentence alignment. Enright and Kondrak (2007) use a simple and fast method for document alignment that relies of overlap of rare but identically spelled words, which are mostly cognates, names, and numbers.

2.5. Comparable corpora
Parallel sentences may also be mined from comparable corpora (Munteanu and Marcu, 2005), such as news stories written on the same topic in different languages. Methods have been proposed to extract matching phrases (Tanaka, 2002) or web pages (Smith, 2002) from such large collections. Munteanu and Marcu (2002) uses suffix trees, and in later work log-likelyhood ratios (Munteanu et al, 2004), to detect parallel sentences. Instead of full sentences, parallel sentence fragments may be extracted from comparable corpora (Munteanu and Marcu, 2006). Quirk et al (2007) propose a generative model for the same task. Extraction of paraphrased sentences from monolingual data may be done using similar methods, such as word overlap and tf/idf (Nelken and Shieber, 2006).

2.6. Dictionary
For language pairs for which no parallel corpora are available but translation dictionaries exist, basic gloss translation may be improved by glossing source corpora and using frequency statistics to restructure the gloss, thus creating a parallel corpus from which a translation model may be learned (Pytlik and Yarowsky, 2006).

2.7. Corpus cleaning
Statistical machine translation models are generally assumed to be fairly robust to noisy data, such as data that includes misalignments. However, data cleaning has been shown to help (Vogel, 2003). Often, for instance in the case of news reports that are rewritten for a different audience during translation, documents are not very parallel, so the task of sentence alignment becomes more of a task of sentence extraction (Fung and Cheung, 2004; Fung and Cheung, 2004). For good performance it has proven crucial, especially when only small amounts of training data are available, to exploit all of the data, may it be by augmenting phrase translation tables to include all words or breaking up sentences that are too long (Mermer et al, 2007).

2.8. Domain adaptation
Machine translation systems are often built for very specific domains, such as movie and television subtitles (Volk and Harder, 2007). A translation model may be trained only from sentences in a parallel corpus that are similar to the input sentence (Hildebrand et al, 2005), which may be determined by language model perplexity (Yasuda et al, 2008). Also, a domain-specific language model may be obtained by including only sentences that are similar to the ones in the target domain (Sethy et al, 2006). We may detect such sentences using the tf/idf method common in information retrieval and then boost the count of these sentences (et al, 2007). In a more graded approach, sentences may be weighted by how much match the target domain (Bach et al, 2008). Instead of discarding some of the training data, the mixture-model approach weights sub-models trained on data from different domains and combines them (Foster and Kuhn, 2007). Mixture models may also used to automatically cluster domains (Civera and Juan, 2007). Such sub-models may be combined as components in the standard log-linear model (Koehn and Schroeder, 2007). When dealing with multiple domain-specific translation models, input sentences need to be classified to the correct domain, which may done using a language model or information retrieval approach (Xu et al, 2007). If the classifier returns proportions on how well the input falls into some of the classes, this may be used to dynamically weight the domain models (Finch and Sumita, 2008). Wu et al (2008) present methods that exploit domain-specific monolingual corpora and dictionaries. Related to domain adaptation is the problem of adapting machine translation systems to different regional language uses (Cheng et al, 2004).

2.9. Truecasing
Truecasing with an HMM is discussed by Lita et al (2003), who mainly relies on a language model. A machine learning approach allows the integration of additional features, such as properties of the source language (Wang et al, 2006).

Chapter 3: Probability Theory

A good introduction into probability theory and information is given by Cover and Thomas (1991). For an application of probabilistic methods to the related field of speech recognition, see the book by Jelinek (1998).

Chapter 4: Word-Based Models

4.1. Word alignment based on co-occurence
Early work in word alignment focused on co-occurence statistics to find evidence for word associations (Kaji and Aizono, 1996). These methods do not model a word alignment, and hence may find evidence for the alignment of a word to multiple translations, a problem called indirect association, which may be overcome with enforcing one-to-one alignments (Melamed, 1996). Kumano and Hirakawa (1994) augment this method with an existing bilingual dictinary. Sato and Nakanishi (1998) use a maximum entropy model for word associations. Ker and Chang (1996) groups words together into sense classes from a thesaurus to improve word alignment accuracy. Co-occurence counts may also be used for phrase alignment, although this typically requires more efficient data for storing all phrases (Cromieres, 2006). Chatterjee and Agrawal (2006) extends a recency vector approach (Fung and McKeown, 1994) with additional constraints. Lardilleux and Lepage (2008) iteratively match the longest common subsequences from sentence pairs and align the remainder.

4.2. IBM models
Our presentation of the IBM models follows the description by Brown et al (1993), who originally presented the statistical machine translation approach in earlier papers (Brown et al, 1988; Brown et al, 1990). Wu and Xia (1994) applies IBM Model 1 to a Chinese–English corpus. For an introduction see also introductions by Knight (1997); Knight (1999). Note that our presentation downplays the noisy channel model to give a simpler presentation. During a 1999 Johns Hopkins University Workshop, the IBM models were implemented in a toolkit called GIZA (Al-Onaizan et al, 1999), later refined into GIZA++ by Och and Ney (2000). GIZA++ is open source and widely used. Instead of hill-climbing to the Viterbi alignment, algorithms such as Estimation of Distributions may be employed (Rodríguez et al, 2006). The stochastic modelling approach for translation is described by Ney (2001). Several reports show how statistical machine translation allows for rapid development with limited resources (Foster et al, 2003; Oard and Och, 2003).

4.3. Symmetrization
Symmetrizing IBM model alignments was first proposed by Och and Ney (2003) and may be improved by already symmetrizing during the IBM Model training (Matusov et al, 2004), or by explicitly modeling the agreement between the two alignments and optimizing it during with EM training (Liang et al, 2006). Different word alignments obtained with various IBM Models and symmetrization methods may also be combined using a maximum entropy approach (Ayan and Dorr, 2006; Ganchev et al, 2008). Crego and Habash (2008) use constraints over syntactic chunks to guide symmetrization.

4.4. Extensions to IBM models
A variation on the IBM models is the HMM model which uses relative distortion but not fertility (Vogel et al, 1996). This model was extended by treating jumps to other source words differently from repeated translations of the same source word (Toutanova et al, 2002), and conditioning jumps on the source word (He, 2007). IBM models have been extended using maximum entropy models (Foster, 2000) to include position (Foster, 2000), part-of-speech tag information (Kim et al, 2000), even in the EM training algorithm (García-Varea et al, 2002; García-Varea et al, 2002). Improvements have also been obtained by adding bilingual dictionaries (Wu and Wang, 2004) and context vectors estimated from monolingual corpora (Wang and Zhou, 2004), lemmatizing words (Dejean et al, 2003; Popovic and Ney, 2004; Pianta and Bentivogli, 2004), interpolating lemma and word aligment models (Zhang and Sumita, 2007), as well as smoothing (Moore, 2004). Mixture models for word translation probabilities have been explored to automatically learn topic-dependent translation models (Zhao and Xing, 2006; Civera and Juan, 2006). Packing words that typically occur in many-to-one alignments into a single token may improve alignment quality (Ma et al, 2007).

4.5. Other iterative methods for word alignment, use of linguistic annotation
Word alignment methods have evolved into iterative algorithms, for instance the competitive linking algorithm by Melamed (1995); Melamed (1996); Melamed (1997); Melamed (2000) or bilingual bracketing (Wu, 1997). Tufiş (2002) extends a simple co-occurence method to align words. These methods have been extended to exploit part-of-speech information (Chang and Chen, 1994; Tiedemann, 2003) in constraint methods (Tiedemann, 2004), translation divergences (Dorr et al, 2002), compositionality constraints (Simard and Langlais, 2003), and syntactic constraints (Cherry and Lin, 2003; Lin and Cherry, 2003; Zhao and Vogel, 2003). Fraser and Marcu (2005) improve word alignments by stemming words in input and output language, thus generalizing over morphological variants. Syntactic constraints may derive from formal criteria of obtaining parallel tree structures, such as the ITG constraint, or from syntactic relationships between words on either side (Cherry and Lin, 2006). Such constraints may be modeled as priors in the generative model (Deng and Gao, 2007).

4.6. Targeting word alignments to syntax-based models
Syntax-based machine translation models may have different requirements for word alignment. DeNero and Klein (2007) consider the affect of word alignment on rule extraction in a HMM alignment model. Rules from a syntax-based machine translation model may be also used to improve word alignments using a EM realignment method (May and Knight, 2007).

4.7. Discriminative approaches to word alignment
The machine learning community has discovered word alignment as an interesting structured prediction problem. Training translation models from sentences annotated with correct word alignments, results are typically much better that the unsupervised approach (Callison-Burch et al, 2004), but such data does not exist in large quantities. Discriminative word alignment methods typically generate statistics over a large unlabeled corpus which may have been aligned with some baseline method such as the IBM models, which form the basis for features that are optimized during machine learning over a much smaller labeled corpus. Discriminative approaches may be use the perceptron algorithm (Moore, 2005; Moore et al, 2006), maximum entropy models (Ittycheriah and Roukos, 2005), neural networks (Ayan et al, 2005), max-margin methods (Taskar et al, 2005), boosting (Wu and Wang, 2005; Wu et al, 2006), support vector machines (Cherry and Lin, 2006), conditional random fields (Blunsom and Cohn, 2006; Niehues and Vogel, 2008) or MIRA (Venkatapathy and Joshi, 2007). Such methods allow the integration of features such as a more flexible fertility model and interactions between consecutive words (Lacoste-Julien et al, 2006). Especially smaller parallel corpora benefit from more attention to less frequent words (Zhang et al, 2005). Discriminative models open a path to add additional features such as ITG constraint (Chao and Li, 2007). Related to the discriminative approach, posterior methods use agreement in the n-best alignments to adjust alignment points (Kumar and Byrne, 2002).

4.8. Discriminative symmetrization
Starting an word alignment resulting from IBM models, additional features may be defined to assess each alignment points. Fraser and Marcu (2006) use additional features during symmetrization, either to be used in re-ranking or integrated into the search. Such features may form the basis for a classifier that adds one alignment points at at time (Ren et al, 2007), possibly based on a skeleton of highly likely alignment points (Ma et al, 2008), or deletes alignment points one at a time from the symmetrized union alignment (Fossum et al, 2008). Fraser and Marcu (2007) present a generative model that allows many-to-many alignments which is trained from unlabeled data, but may be improved with small amounts of labeled data.

4.9. Combining word aligners
Tufiş et al (2006) combine different word aligners with heuristics. Schrader (2006) combines statistical methods with manual rules. Elming and Habash (2007) combine word aligners that operate under different morphological analysis of one the languages — Arabic. Huang et al (2005) discuss the issue of interpolating word alignment models trained on data from multiple domains. Word alignment in a specific domain may also be improved with a dictionary obtained from a general domain (Wu and Wang, 2004). Combining different word aligners my also improve performance (Ayan et al, 2004). the log-linear modeling approach may be used for the combination of word alignment models with simpler models, for instance based on part-of-speech tags (Liu et al, 2005).

4.10. Evaluation of word alignment
To better understand the word alignment problem, parallel corpora have been annotated with word alignments for language pairs such as German–English, French–English, and Romanian–English, etc. These have been the basis for competitions on word alignment (Mihalcea and Pedersen, 2003; Martin et al, 2005). The relationship between alignment quality and machine translation performance is under some discussion (Langlais et al, 1998; Fraser and Marcu, 2007). Vilar et al (2006) point to mismatches between alignment error rate and machine translation performance. New measures have been proposed to overcome the weakness of alignment error rate (Carl and Fissaha, 2003). Giving less weight to alignment points that connect multiple aligned words improves correlation (Davis et al, 2007). Lopez and Resnik (2006) shows impact of word alignment quality on phrase-based models. Ayan and Dorr (2006) compare alignment quality and machine translation performance and also stress the interaction with the phrase extraction method. Albeit computationally very expensive, word alignment quality may also be directly optimized on machine translation performance (Lambert et al, 2007).

4.11. Pivot languages
Also, pivot or bridge languages have been used to improve word alignments, such as in the case of Chinese–Japanese, where only large parallel corpora of these languages paired with English exist (Wang et al, 2006). Bridge languages were also explored by Kumar et al (2007).

4.12. Aligning longer units
Instead of tackling the full word alignment problem, more targeted work focuses on terminology extraction, for instance the extraction of domain-specific lexicons (Resnik and Melamed, 1997), noun phrases (Kupiec, 1993; Eijk, 1993; Fung, 1995), collocations (Smadja et al, 1996; Echizen-ya et al, 2003; Orliac and Dillinger, 2003), non-compositional compounds (Melamed, 1997), named entities (Moore, 2003), technical terms (Macken et al, 2008), or other word sequences (Kitamura and Matsumoto, 1996; Ahrenberg et al, 1998; Martinez et al, 1999; Sun et al, 2000; Moore, 2001; Yamamoto et al, 2001; Baobao et al, 2002; Wang and Zhou, 2002). Translation for noun phrases may be learned by checking automatically translated candidate translations against frequency counts on the web (Robitaille et al, 2006; Tonoike et al, 2006).

4.13. Learning bilingual dictionaries from comparable corpora
It is also possible to extract terminologies from non-parallel comparable corpora. If only monolingual corpora are available, the first task is to find words that are translations of each other. Often, a seed lexicon is needed, which may be identically spelled words or cognates (Koehn and Knight, 2002), although attempts have been made without such a seed but purely relying on co-occurrence statistics (Rapp, 1995) or generative models based on canonical correlation analysis (Haghighi et al, 2008). Several criteria may be used to find matching words, such as co-occurrence vectors based on mutual information (Fung, 1997; Kaji, 2004) or id/idf (Fung and Yee, 1998; Chiao and Zweigenbaum, 2002), co-occurence vectors with considerations of ambiguous words (Tanaka and Iwasaki, 1996) or reduced using latent semantic analysis (Kim et al, 2002) or Fisher kernels (Gaussier et al, 2004), heterogeneity of the word context (Fung, 1995), distributional similarity (Rapp, 1999), semantic relationship vectors (Diab and Finch, 2000), spelling similarity (Schulz et al, 2004), automatically constructed thesauri (Babych et al, 2007), syntactic templates for the word context (Gamallo~Otero, 2007). Other resources such as Wikipedia or WordNet may help with the construction of dictionaries (Ramiírez et al, 2008). These methods may also be applied for collocations, not just single words (Lü and Zhou, 2004; Daille and Morin, 2008). If monolingual corpora and bilingual dictionaries are available, the task is to find word senses or sets of mutually translated words across multiple languages. Kikui (1998) use context vectors to disambiguate words, then adding an initial clustering step (Kikui, 1999). Koehn and Knight (2000) use a language model to learn translation probabilities for ambiguous word using the EM algorithm. Sammer and Soderland (2007) use point-wise mutual information to match contexts in which words occur to separate out different sense of a word. Li and Li (2004) apply the Yarowski algorithm (Yarowsky, 1994) to bootstrap bilingual word translation models for ambiguous words. The use of bridge languages has been shown to be useful (Mann and Yarowsky, 2001; Schafer and Yarowsky, 2002). Otero (2005) extends the use of the DICE coefficient with local context to better deal with polysemous words. Comparable corpora also allow the construction of translation models between a formal language and a dialect (Hwa et al, 2006).

4.14. Lexical choice
Related to the problem of translating words is the problem of word sense disambiguation. When a word has several senses, these senses may have different translation. Already early work (Brown et al, 1991) tried to integrate word sense disambiguation methods in statistical machine translation. These methods include local (García-Varea et al, 2001) or broader context (Vickrey et al, 2005), or dictionary information (Lee and Kim, 2002) in the lexical translation decision, although straight-forward application of word sense disambiguation methods does not always lead to improvements (Carpuat and Wu, 2005).

4.15. Machine translation without word alignments
While word alignment is generally assumed to be an essential element of statistical translation models, methods have been proposed that first generate a bag of words from the source bag of words and then apply an ordering model (Venkatapathy and Bangalore, 2007; Bangalore et al, 2007).

Chapter 5: Phrase-Based Models

5.1. Introduction
The modern statistical phrase-based models are rooted in work by Och and Weber (1998); Och et al (1999); Och (2002); Och and Ney (2004) on alignment template models. These models defined phrases over word classes that were then instantiated with words. Translating with the use of phrases in a statistical framework was also proposed by Melamed (1997); Wang and Waibel (1998); Venugopal et al (2003); Watanabe et al (2003). Marcu (2001) propose the use of phrases within word-based model decoding. The use of log-linear models was proposed by Och and Ney (2002). Our description follows Koehn et al (2003), which is similar to the model by Zens et al (2002); Zens and Ney (2004). Tribble et al (2003) suggest the use overlapping phrases. Lopez and Resnik (2006) shows the contribution of the different components of a phrase-based model.

5.2. Learning translation models
Several methods to extract phrases from a parallel corpus have been proposed. Most make the use of word alignments (Tillmann, 2003; Zhang et al, 2003; Zhao and Vogel, 2005; Zhang and Vogel, 2005; Setiawan et al, 2005). One may restrict extraction of phrase pairs to the smallest phrases that cover the sentence (Mariño et al, 2005). Lambert and Banchs (2005) compare this restrictive method with the method described in this book and proposes some refinements. Phrase alignment may be done directly from sentence-aligned corpora using a probabilistic model (Shin et al, 1996), pattern mining methods (Yamamoto et al, 2003), or using matrix factorization (Goutte et al, 2004). IBM Model 1 probabilities may be used to separate word aligned to each phrase against words outside it (Vogel, 2005) — a method also used for splitting long sentences (Xu et al, 2005). Zhao et al (2004) use a measure based on the td-idf score from information retrieval to score phrase translations. Additional feature scores may be also used during the parameter tuning of the decoder to determine which phrase pairs should be discarded (Deng et al, 2008). Kim and Vogel (2007) use an iterative method that adds extracted phrases to the parallel corpus to bootstrap better alignments and extract better phrases. Turchi et al (2008) give an overall analysis of the learning problem for phrase-based machine translation.

5.3. EM training of phrase models
The joint phrase model we described is taken from Marcu and Wong (2002), who also propose an improved model initialization than presented here. The joint model may also be improved by constraining it with alignment points from the intersection of IBM Model alignments (Birch et al, 2006; Birch et al, 2006) or by not strictly requiring a unique phrase alignment (Moore and Quirk, 2007). DeNero et al (2006) point to some problems when using EM training with conditional probabilities. Cherry and Lin (2007) show that the ITG constraint helps the joint phrase model approach, partly by enabling a faster algorithm with less search errors. The phrase alignment problem is NP-complete (DeNero and Klein, 2008).

5.4. Refinements of phrase model
Usually, the segmentation of the source is not modeled, or only a phrase count feature is used, but adding a source phrase segmentation model may be beneficial (Blackwood et al, 2008). Models may allow word insertion to account for spurious function words (Xu, 2005), or allow for words to be dropped by translating them into the empty phrase (Li et al, 2008).

5.5. Context features in translation
Phrase translation may be informed by additional context features, for instance by applying methods used in word sense disambiguation (Carpuat et al, 2006). Such features may be integrated using a maximum entropy model (Bangalore et al, 2006), support vector machines (Giménez and Màrquez, 2007), or by directly integrating more complex word sense disambiguation components, such as an ensemble of different machine learning methods (Carpuat and Wu, 2007; Carpuat and Wu, 2007). Ittycheriah and Roukos (2007) propose a maximum entropy model for phrase translation. Syntactic context dependencies may be added to phrase translations in the phrase-based approach, for instance verb-argument relationships (Hwang and Sasaki, 2005), or the syntactic structure underlying each phrase translation (Sun et al, 2007). Gimpel and Smith (2008) add features around the context of a source phrase into a probabilistic back-off model.

5.6. Smoothing
Smoothing phrase translation probability using discounting methods that are common in language modeling have been shown to improve performance (Foster et al, 2006). Continuous space models implemented as neural networks allow smoothing of phrase translation table probabilities (Schwenk et al, 2007).

5.7. Paraphrasing
More robust translation models may be generated by paraphrasing phrase translation entries (Callison-Burch et al, 2006) or the parallel corpus (Nakov and Hearst, 2007). Paraphrases may be extracted from a parallel corpus (Banrd and Callison-Burch, 2005), for more accuracy dependency structure may be exploited (Hwang et al, 2008).

5.8. Use of dictionaries
Existing bilingual dictionaries may be simply added as additional parallel data to the training data. This may, however, miss the right context in which these words occur. Okuma et al (2007) propose to insert phrases into the phrase tables that adapt existing entries with a very similar word to the dictionary word by replacing it with the dictionary word.

5.9. Pruning large translation models
Quirk and Menezes (2006) argue that extracting only minimal phrases, i.e. the smallest phrase pairs that map each entire sentence pair, does not hurt performance. This is also the basis of the n-gram translation model (Mariño et al, 2006; Costa-jussà et al, 2007), a variant of the phrase-based model. Discarding unlikely phrase pairs based on significance tests on their more-than-random occurrence reduces the phrase table drastically and may even yield increases in performance (Johnson et al, 2007). Wu and Wang (2007) propose a method for filtering the noise in the phrase translation table based on a log likelihood ratio. Kutsumi et al (2005) uses a support vector machine for cleaning phrase tables. Such considerations may also be taking into account in second pass phrase extraction stage that does not extract bad phrase pairs (Zettlemoyer and Moore, 2007). When faced with porting phrase-based models to small devices such as PDAs (Zhang and Vogel, 2007), the translation table has to be reduced to fit a fixed amount of memory. Eck et al (2007); Eck et al (2007) prune the translation table based on how often a phrase pair was considered during decoding and how often it was used in the best translation.

5.10. Suffix arrays for storing translation models
With the increasing size of available parallel corpora and translation models, efficient use of working memory becomes an issue, motivating the development of parallel infrastructures for training such as Google's MapReduce (Dyer et al, 2008). Alternatively, the translation table may be represented in a suffix array as proposed for a searchable translation memory (Callison-Burch et al, 2005) and integrated into the decoder (Zhang and Vogel, 2005). Callison-Burch et al (2005) propose a suffix-tree structure to keep corpora in memory and extract phrase-translations on the fly. Hierarchical phrase-based model (see Chapter 11) may also be stored in such a way (Lopez, 2007) and allow for much bigger models (Lopez, 2008). Suffix arrays may also be used to quickly learn phrase alignments from a parallel corpus without the use of a word alignment (McNamee and Mayfield, 2006). Related to this is the idea of prefix data structures for the translation which allow quicker access and storing the model on disk for on-demand retrieval of applicable translation options (Zens and Ney, 2007).

5.11. Lexicalized reordering
The lexicalized reordering model was first proposed by Tillmann (2004), our description follows Koehn et al (2005). A similar model was proposed by Ohashi et al (2005). These models have been integrated in finite state implementations (Kumar and Byrne, 2005). Nagata et al (2006) extend lexicalized reordering models with better generalization by conditioning on words instead of phrases and clustering words into classes. Similarly, Xiong et al (2006); Zens and Ney (2006) use a maximum entropy classifier to learn better reordering probabilities. Features over the source syntax tree may be included in such reordering models (Xiong et al, 2008) as well as in the translation model (Xiong et al, 2008). They may also predict reordering distance (Al-Onaizan and Papineni, 2006).

5.12. Phrase-based models and example-based translation
Phrase-based SMT is related to example-based machine translation (Somers, 1999). Some recent systems blur the distinction between the two fields (Groves and Way, 2005; Paul et al, 2005; Tinsley et al, 2008). Various combinations of methods from SMT and EBMT are explored by Groves and Way (2006). Similar convergence takes place when combining statistical machine translation with translation memory, for instance by looking for similar sentences in the training data and replacing the mismatch with translation chosen with statistical translation methods (Hewavitharana et al, 2005). Along these lines, phrase-based models may be improved with dynamically constructing translations for unknown phrases by using similar phrases that differ in a word or two and inserting lexical translations for the mismatched words (He et al, 2008). Statistical machine translation models may be used to select the best translation from several example-based systems (Paul and Sumita, 2006).

Chapter 6: Decoding

6.1. Stack decoding
The stack decoding algorithm presented here has it roots in work by Tillmann et al (1997), who describe a dynamic programming algorithm, similar to techniques for hidden Markov models, for monotone decoding of word-based models. Allowing for reordering makes decoding more complex. Efficient A* search makes the translation of short sentences possible (Och et al, 2001). However, longer sentences require pruning Wang and Waibel (1997); Nießen et al (1998) or restrictions on reordering (Tillmann and Ney, 2000) or both (Tillmann and Ney, 2003). Ortiz-Martínez et al (2006) compare different type of stack decoding with varying number of stacks. Delaney et al (2006) speed up stack decoding with A* pruning. Moore and Quirk (2007) stress the importance of threshold pruning and the avoidance of expanding doomed hypotheses (Moore and Quirk, 2007). Some efficiency may be gained by collapsing contexts that are invariant to the language model (Li and Khudanpur, 2008).

6.2. Reordering constraints
Matusov et al (2005) constrain reordering when the input word sequence was consistently translated monotone in the training data. Zens and Ney (2003); Zens et al (2004); Kanthak et al (2005) compare different reordering constraints and their effect on translation performance, including the formal grammar ITG constraint, which may be further restricted by insisting on a match to source side syntax (Yamamoto et al, 2008). Similarly, reordering may be restricted to syntactically cohesion (Cherry, 2008). Ge et al (2008) integrate other linguistically inspired reordering models into a phrase-based decoder. Dreyer et al (2007) compare reordering constraints in terms of oracle BLEU, i.e., the maximum possible BLEU score.

6.3. Decoding for word-based models
Several decoding methods for word-based models are compared by Germann et al (2001), who introduce a greedy search (Germann, 2003) and integer programming search method. Search errors of the greedy decoder may be reduces by a better initialization, for instance using an example-based machine translation system for seeding the search (Paul et al, 2004). A decoding algorithm based on alternately optimizing alignment (given translation) and translation (given alignment) is proposed by Udupa et al (2004).

6.4. Decoding with finite state toolkits
Instead of devising a dedicated decoding algorithm for statistical machine translation, finite state tools may be used, both for word-based (Bangalore and Riccardi, 2000; Bangalore and Riccardi, 2001; Tsukada and Nagata, 2004; Casacuberta and Vidal, 2004), alignment template (Kumar and Byrne, 2003) and phrase-based models. The use of finite state toolkits also allows for the training of word-based and phrase-based models. The implementation by Deng and Byrne (2005) is available as the MTTK toolkit (Deng and Byrne, 2006). Similarly, the IBM models may be implemented using graphical model toolkits (Filali and Bilmes, 2007). Pérez et al (2007) compare finite state implementation of word and phrase-based models.

6.5. Decoding complexity
Knight (1999) showed that the decoding problem is NP-complete. Udupa and Maji (2006) provide further complexity analysis for training and decoding with IBM Models.

Chapter 7: Language Models

7.1. Introduction
The discount methods we present in this chapter are proposed by Good (1953) — see also the description by Gale and Sampson (1995) —, Witten and Bell (1991), as well as Kneser and Ney (1995). A good introduction to the topic of language modelling is given by Chen and Goodman (1998). Instead of training language models, large corpora can be also exploited by checking if potential translations occur in them as sentences (Soricut et al, 2002).

7.2. Targeted language models
Zhao et al (2004) describe a method to extract training data for targeted language models based on similarity to the n-best output of a first decoding stage. Hasan and Ney (2005) clusters sentence according to syntactic types (i.e. questions vs. statements), trains a separate language model for each cluster and interpolates it with the full language model for coverage.

7.3. Use of morphology in language models
Factored language models were also used in statistical machine translation systems (Kirchhoff and Yang, 2005). For morphologically rich languages, better language models may predict individual morphemes. Sarikaya and Deng (2007) propose a joint morphological-lexical language model that uses maximum entropy classifiers to predict each morpheme.

7.4. Using very large language models
Using very language models that require more space than the available working memory may be distributed over a cluster of machines, but may not need sophisticated smoothing methods (Brants et al, 2007). Alternatively, storing the language model on disk using memory mapping is an option (Federico and Cettolo, 2007). The methods for quantizing language model probabilities are presented by Federico and Bertoldi (2006), who also examine this for translation model probabilities. Alternatively, lossy data structures such as bloom filters may be used to store very large language models efficiently (Talbot and Osborne, 2007; Talbot and Osborne, 2007). Schwenk et al (2006) introduce continuous space language model that are training using neural networks. The use of very large language models is often reduced to a reranking stage (Olteanu et al, 2006).

Chapter 8: Evaluation

8.1. Evaluation campaigns
The first machine translation evaluation campaign, in which both statistical and traditional rule-based machine translation systems participated was organized by ARPA in the early 1990s. White et al (1994) present results and discuss in detail the experience with different evaluation strategies. The straight-forward application of metrics for human translators proved difficult, and was abandoned along with measurements of productivity of human assisted translation, with hinges to a large degree on the quality of the support tools and the level of expertise of the translator. Ultimately, only reading comprehension test with multiple choice questions and adequacy and fluency judgments were used in the final evaluation. Hamon et al (2007) discuss the evaluation of a speech-to-speech translation system.

8.2. Manual metrics
King et al (2003) present a large range of evaluation metrics for machine translation systems that go well beyond the translation quality measures who devoted the bulk of this chapter to. Miller and Vanni (2005) propose clarity and coherence as manual metrics. Reeder (2004) shows the correlation between fluency and the number of words it takes to distinguish between human and machine translations. Grading standards for essays from foreign language learners may be used for machine translation evaluation. Using these standards reveals that machine translation has trouble with basic levels, but scores relatively high on advanced categories (Reeder, 2006). A manual metric that can be automated is to ask for specific translation errors — the questions may be based on past errors (Uchimoto et al, 2007). Vilar et al (2007) argue for pairwise system comparisons as metric, which leads to higher inter and intra annotator agreement (Callison-Burch et al, 2007).

8.3. Task-based metrics
Task-based evaluation of machine translation tests the usefulness of machine translation directly, for instance the ability of human consumers of machine translation output to answer who, when, and where questions (Voss and Tate, 2006). Jones et al (2006) use a foreign language learner test for measuring speech-to-speech machine translation performance. The adaptation of such a test (the Defense Language Proficiency Test, DLPT) for machine translation revealed that Arabic–English machine translation performance achieves a passing grade up to level 2+, but has a relatively weak performance on basic levels.

8.4. Word error rate
Word error rate was first used for the evaluation of statistical machine translation by Tillmann et al (1997), who also introduce position-independent error rate. Allowing block movement (Leusch et al, 2003) leads to the definition of the Cder metric (Leusch et al, 2006). ter allows for arbitrary block movements (Snover et al, 2006). MaxSim uses an efficient polynomial matching algorithm that also uses lemma and part-of-speech tag matches (Chan and Ng, 2008). The way automatic evaluation metrics work also depends on the tokenization of system output and reference translations (Leusch et al, 2005).

8.5. N-gram matching metrics
The Bleu evaluation metric is based on n-grams not words (Papineni et al, 2001). Several variants of n-gram matching have been proposed: weighting n-grams based on their frequency (Babych and Hartley, 2004), or other complexity metrics (Babych et al, 2004). GTM is based on precision and recall (Melamed et al, 2003; Turian et al, 2003). Echizen-ya and Araki (2007) propose impact, which is more sensitive to the longest matching n-grams. A metric may benefit from using an explicit alignment of system output and reference while maintaining the advantages of n-gram based methods such as bleu (Liu and Gildea, 2006) and by training such a metric to correlate to human judgment (Liu and Gildea, 2007). Lavie et al (2004) emphasize the importance of recall and stemmed matches in evaluation, which led to the development of the Meteor metric (Banerjee and Lavie, 2005; Lavie and Agarwal, 2007). Partial credit for stemmed matches may also be applied to Bleu and Ter (Agarwal and Lavie, 2008).

8.6. Syntax-based metrics
Metrics may be sensitive to syntactic information (Liu and Gildea, 2005). A more syntactic approach to machine translation may consider the matching of dependency structure of reference and system output (Owczarzak et al, 2007), possibly including dependency labels (Owczarzak et al, 2007). A wide range of linguistic features for evaluation is explored by Giménez and Màrquez (2007). A method based on latent semantic analysis has be proposed for machine translation evaluation (Reeder, 2006).

8.7. Analytical metrics
Automatic metrics may also help in pinpointing the type of errors machine translation systems make, for instance the relationship of wer and per indicates the severity of reordering problems, and the relationship of bleu on stemmed words and surface forms indicates the need for better morphological generation (Popovic et al, 2006). We may also gain insight into the types of errors by examining word error rates for different part-of-speech (Popovic and Ney, 2007).

8.8. Evaluation without reference
Some researchers investigate methods that allow automatic evaluation without the use of a reference translation. Gamon et al (2005) train a model that combines the language model probability of the output with syntactic and semantic features. Albrecht and Hwa (2007); Albrecht and Hwa (2008) propose to learn a metric from human judgments and pseudo-reference translations, which are generated by using other machine translation systems that are not necessarily better.

8.9. Correlation of automatic and manual metrics
The credibility of automatic evaluation metrics rests on their correlating with reliable human judgments. Coughlin (2003) finds evidence in support of Bleu in a total of 124 evaluations for many European language pairs. Other evaluation campaigns continuously assess correlation of human and automatic metrics, such as in the CESTA campaign (Surcin et al, 2005; Hamon et al, 2007). Yasuda et al (2003); Finch et al (2004) investigate the required number of reference translations. The number of reference translations may be increased by paraphrasing (Finch et al, 2004; Owczarzak et al, 2006). The same idea is behind changing the reference translation by paraphrasing to make it more similar to the reference (Kauchak and Barzilay, 2006), or to attempt to paraphrase unmatched words in the system output (Zhou et al, 2006). Hamon and Mostefa (2008) find that the quality of the reference translations is not very important. Akiba et al (2003) show strong correlation for Bleu, only if systems are of similar type. Bleu tends to correlate less when comparing human translators with machine translation systems (Culy and Riehemann, 2003; Popescu-Belis, 2003), or when comparing statistical and rule-based systems (Callison-Burch et al, 2006). Amigó et al (2006) find that the relationship between manual metrics that measure human acceptability and the automatic metrics that check the similarity of system output with human translations is a bit more complex.

8.10. Trained metrics
Albrecht and Hwa (2007) argue for the general advantages of learning evaluation metrics from a large number of features, although Sun et al (2008) point out that carefully designed features may be more important. Jones and Rusk (2000) propose a method that learns automatically to distinguish human translations from machine translations. Since in practice the purpose of evaluation is to distinguish good translations from bad translations, it may be beneficial to view evaluation as a ranking task (Ye et al, 2007; Duh, 2008). Lin and Och (2004) propose a metric for the evaluation of evaluation metrics, which does not require human judgment data for correlation. The metric is based on the rank given to a reference translation among machine translations. Multiple metrics may be combined uniformly (Giménez and Màrquez, 2008), or by adding metrics greedily until no improvement is seen (Giménez and Màrquez, 2008).

8.11. Difficulty to translate
The difficulty of translating a text may depend on many factors, such as source, genre, or dialect, some of which may be determined automatically (Kirchhoff et al, 2007). Uchimoto et al (2005) suggests to use back-translation to assess which input words will cause problems, and then prompt the user of an interactive machine translation system to rephrase the input.

8.12. Statistical significance
Our description of the bootstrap resampling method (Efron and Tibshirani, 1993) for estimating statistical significance follows the description by Koehn (2004). For further comments on this technique, see work by Riezler and Maxwell (2005). Estrella et al (2007) examine how big the test set needs to be for a reliable comparison of different machine translation systems.

Chapter 9: Discriminative Training

9.1. Word graphs
The generation of word graphs from the search graph of the decoding process is first described by Ueffing et al (2002), Zens and Ney (2005) describes additional pruning methods to achieve a more compact graph. Word graphs may converted into confusion networks or mined for n-best lists. Alternatively, the n-list may be extended with additional candidate translation generated with a language model that is trained on the n-best list (Chen et al, 2007) or from overlapping n-grams in the original n-best list (Chen et al, 2008).

9.2. Parameter tuning
Our description of minimum error rate training follows the original proposal by Och (2003). Properties of this method are discussed by Moore and Quirk (2008). Cer et al (2008) discuss regularizing methods that average BLEU scores over a small window to smooth over the curves. Many researchers also report on using the Simplex algorithm (Nelder and Mead, 1965). The implementation of these methods is described by (Press et al, 1997). Tuning may use multiple reference translations, or even additional reference translations generated by paraphrasing the existing reference (Madnani et al, 2007).

9.3. Re-ranking
Minimum error rate training is used for re-ranking (Och et al, 2004), other proposed methods are based on ordinal regression to separate good translations from bad ones (Shen et al, 2004) and SPSA (Lambert and Banchs, 2006). Duh and Kirchhoff (2008) use boosting to improve over the log-linear model without any additional features. Hasan et al (2007) examine how big n-best lists are useful, both considering Oracle-BLEU and actual re-ranking performance, and see gains with n-best lists of up to 10,000. The use of a log-linear model imposes certain restrictions on features that may be relaxed using other machine learning approaches such as kernel methods or Gaussian mixture models (Nguyen et al, 2007). See work by Chen et al (2007) and Patry et al (2007) for some recent examples of the type of features used in re-ranking.

9.4. Large-scale discriminative training
Large-scale discriminative training methods that optimize millions of features over the entire training corpus have emerged more recently. Tillmann and Zhang (2005) add a binary feature for each phrase translation table entry and train feature weights using a stochastic gradient descent method. Kernel regression methods may be applied to the same task (Wang et al, 2007; Wang and Shawe-Taylor, 2008). Wellington et al (2006) applies discriminative training to a tree translation model. Large scale discriminative training may also use the perceptron algorithm (Liang et al, 2006) or variations thereof (Tillmann and Zhang, 2006) to directly optimize on error metrics such as bleu. Arun and Koehn (2007) compare MIRA and the Perceptron algorithm and point out some of the problems on the road to large-scale discriminative training. This approach has also been applied to a variant of the hierarchical phrase model (Watanabe et al, 2007; Watanabe et al, 2007). Blunsom et al (2008) argue the importance to perform feature updates on all derivations of translation, not just the most likely one, to address spurious ambiguity. Machine translation may be framed as a structured prediction problem, which is a current strain of machine learning research. Zhang et al (2008) frame ITG decoding in such a way and propose a discriminative training method following the SEARN algorithm (Daumé III et al, 2006).

9.5. Minimum Bayes risk
Minimum Bayes risk decoding was introduced initially for n-best-list re-ranking (Kumar and Byrne, 2004), and has been shown to be beneficial for many translation tasks (Ehling et al, 2007). Optimizing for the expected error, not the actual error, may also be done in parameter tuning (Smith and Eisner, 2006). Related to minimum Bayes risk is the use of n-gram posterior probabilities in re-ranking (Zens and Ney, 2006; Alabau et al, 2007; Alabau et al, 2007).

9.6. Confidence measures
Our description of confidence measures is largely based on the work by Ueffing et al (2003); Ueffing and Ney (2007). Other researchers extended this work extend this work using machine learning methods (Blatz et al, 2004) or to multiple, especially non-statistical, systems (Akiba et al, 2004). Sanchis et al (2007) combine a number of features in a smoothed naive Bayes model. Confidence measures have been used to for computer aided translation tools (Ueffing and Ney, 2005). Confidence measures have also been used in a self-training method: By translating additional input language text, we bootstrap a parallel corpus to train an additional translation table, but it helps if we filter the bootstrapped corpus using confidence metrics (Ueffing, 2006). This may be also done iteratively, by re-training the model and adding more sentences each time (Ueffing et al, 2007). Alternative, the original training data may be discarded and the phrase table (and other components) only trained from the n-best lists (Chen et al, 2008).

9.7. System combination
System combination is also called Multi-Engine Machine Translation (Nomoto, 2003), and may take the form of a classifier that decides which system's output to trust (Zwarts and Dras, 2008). The prevailing idea is to combine the output of different systems without deeper integration of their architectures, which has been very successful in speech recognition (Fiscus, 1997) and follows a machine learning paradigm called ensemble learning. Bangalore et al (2002) use a confusion network representation for the computation of consensus translations of different off-the-shelf systems, while Zaanen and Somers (2005) use a word graph and Jayaraman and Lavie (2005) use a decoder to search through possible combinations, possibly augmented with a phrase translation table (Eisele et al, 2008). Eisele (2005) compares a statistical and a heuristic method. Our description follows work by Rosti et al (2007); Rosti et al (2007) who propose the construction of confusion networks by aligning the different system outputs, which may be improved by a incremental alignment process (Rosti et al, 2008). Matusov et al (2006) use IBM Models to align the words in the different system translations, while Karakos et al (2008) suggest to limit alignments with the ITG constraint. Word alignment may be aided by using information about synonyms, which may be provided by Wordnet synsets (Ayan et al, 2008). Huang and Papineni (2007) use some of the internals of the translation systems, such as phrase segmentation, to guide combination. Macherey and Och (2007) look at various properties of system combination methods, such as the relative quality of the system and how well they correlate, and its effect on success. System combination may be also used to combine systems that use the same methodology, but are trained on corpora with different morphological preprocessing (Lee, 2006), or different word alignment methods (Kumar et al, 2007).

Chapter 10: Integrating Linguistic Information

10.1. Unknown words
Even with large training corpora, unknown words in the test data will appear. Habash (2008) points out that these come in different types, which require different solutions: unknown names need to be transliterated, unknown morphological variants need to be matched to known ones, and spelling errors need to be corrected. Another case are abbreviation, which may be expanded with a corpus-driven method (Karakos et al, 2008).

10.2. Transliteration with finite state machines
Early work in machine transliteration of names uses finite state machines. Our description follows the work by Knight and Graehl (1997). Such models may be extended by using a larger Markov window during mappings, i.e. using a larger context (Jung et al, 2000). Schafer (2006) compares a number of different finite state transducer architectures. For closely related language pairs, such as Hindi–Urdu, deterministic finite state machines may suffice (Malik et al, 2008). Transliteration may use either phonetic representation to match characters of different writing systems (Knight and Graehl, 1997) or map characters directly (Zhang et al, 2004). Phoneme and grapheme information may be combined (Bilac and Tanaka, 2004). Given small training corpora, using phonetic representations may be more robust (Yoon et al, 2007).

10.3. Transliteration with other methods
Improvements have been shown when including context information in a maximum entropy model (Goto et al, 2003; Goto et al, 2004). Other machine learning methods such as discriminative training with the perceptron algorithm have been applied to transliteration (Zelenko and Aone, 2006). Oh and Choi (2002) use a pipeline of different approaches for the various mapping stages. Lin and Chen (2002) present a gradient descent method to map phoneme sequences. Using phoneme and letter chunks may lead to better performance (Kang and Kim, 2000). Li et al (2004) propose a method similar to the joint model for phrase-based machine translation. Such a substring mapping model has also been explored by others (Ekbal et al, 2006), and may be implemented as bi-stream HMM (Zhao et al, 2007) or adapting a monotone phrase-based translation model (Sherif and Kondrak, 2007). For different scripts for the same language, rule-based approaches typically suffice (Malik, 2006). Ensemble learning, i.e., the combination of multiple transliteration engines, has been shown to be successful (Oh and Isahara, 2007).

10.4. Forward transliteration
The transliteration of Japanese names into English, or conversely the translation of English names into Japanese is an ambiguous task, often no standard transliterations exist. Normalization of corpora may be useful (Ohtake et al, 2004). Another interesting problem, which has been studied for English–Chinese, is that the foreign name should not be rendered in a way that suggest a negative impression by containing characters that have negative meaning (Xu et al, 2006). Also, when transliterating names into Chinese, the original language of the name and its gender matters, which may be incorporated into a transliteration model (Li et al, 2007).

10.5. Training data for transliteration
Training data may be collected from parallel corpora (Lee and Chang, 2003; Lee et al, 2004), or by mining comparable data such as news streams (Klementiev and Roth, 2006; Klementiev and Roth, 2006). Training data for transliteration may also be obtained from monolingual text where the spelling of a foreign name is followed by its native form in parenthesis (Lin et al, 2004; Chen and Chen, 2006; Lin et al, 2008), which is common for instance for unusual English names in Chinese text. Such an acquisition may be improved by bootstrapping — iteratively extracting high-confidence pairs and improving the matching model (Sherif and Kondrak, 2007). Sproat et al (2006) fish for name transliteration in comparable corpora, also using phonetic correspondences. Tao et al (2006) exploit additionally temporal distributions of name mentions, and Yoon et al (2007) use a Winnow algorithm and a classifier to bootstrap the acquisition process. Cao et al (2007) use various features, including that a Chinese character is part of a transliteration a priori in a perceptron classifier. Large monolingual corpus resources such as the web are used for validation (Al-Onaizan and Knight, 2002; Al-Onaizan and Knight, 2002; Qu and Grefenstette, 2004; Kuo et al, 2006; Yang et al, 2008). Of course, training data may also be manually created, possibly aided by an active learning component that suggests the most valuable new examples (Goldwasser and Roth, 2008).

10.6. Integrating transliteration with machine translation
Kashani et al (2007) present methods to integrate transliteration into the main machine translation application. Hermjakob et al (2008) give a detailed discussion about integration issues, including a tagger for when to transliterate and when to translate.

10.7. Morphology
Statistical models for morphology as part of a statistical machine translation system have been developed for inflected languages (Nießen and Ney, 2001; Nießen and Ney, 2004). Morphological features may be replaced by pseudo-words (Goldwater and McClosky, 2005). Morphological annotation is especially useful for small training corpora (Popović et al, 2005). For highly agglutinative languages such as Arabic, the main focus is splitting off affixes with various schemes (Habash and Sadat, 2006), or by combination of such schemes (Sadat and Habash, 2006). A straight-forward application of the morpheme splitting idea may not always lead to performance gains (Virpioja et al, 2007). Yang and Kirchhoff (2006) present a back-off method that resolves unknown words by increasingly aggressive morphological stemming and compound splitting. Denoual (2007) uses spelling similarity to find the translation of unknown words by analogy. Talbot and Osborne (2006) motivate similar work by reducing redundancy in the input language, again mostly by morphological stemming. Lemmatizing words may improve word alignment performance (Corston-Oliver and Gamon, 2004). Using the frequency of stems in the corpus in a finite state approach may guide when to split (Isbihani et al, 2006), potentially guided by a small lexicon (Riesa and Yarowsky, 2006). Splitting off affixes may also be a viable strategy when translating into morphologically rich languages (El-Kahlout and Oflazer, 2006). Different morphological analyses may be encoded in a confusion network as input to the translation system to avoid hard choices (Dyer, 2007; Dyer, 2007). This approach may be extended to lattice decoding and tree-based models (Dyer et al, 2008).

10.8. Generating rich morphology
Generating rich morphology poses different challenges. Often relevant information is distributed widely over the input sentence or miss altogether. Minkov et al (2007) use a maximum entropy model to generate rich Russian morphology and show improved performance over using the standard approach of relying on the language model. Such a model may be used for statistical machine translation by adjusting the inflections in a postprocessing stage (Toutanova et al, 2008). Translation between related morphologically rich related languages may model the lexical translation step as a morphological analysis, transfer and generation process using finite state tools (Tantug et al, 2007). But also splitting words into stem and morphemes is a valid strategy for translating into a language with rich morphology as demonstrated for English–Turkish (Oflazer and El-Kahlout, 2007) and English–Arabic (Badr et al, 2008), and also for translating between two highly inflected languages as in the case of Turkman–Turkish language pairs (Tantug et al, 2007). Translating unknown morphological variants may be learned by analogy to other morphological spelling variations (Langlais and Patry, 2007). For very closely related languages such as Catalan and Spanish translating not chunks of words but chunks of letters in a phrase-based approach achieves decent results, and addresses very well the problem of unknown words (Vilar et al, 2007).

10.9. Compounds
Languages like German where noun compounds are merged into single words, require compound splitting (Brown, 2002; Koehn and Knight, 2003) for translating unknown compounds to improve machine translation performance (Holmqvist et al, 2007). When translating into these languages, compounds may be merged into single words (Stymne et al, 2008).

10.10. Word segmentation
Segmenting words is a problem for languages where the writing system does not include spaces between words, such as many Asian languages. Zhang and Sumita (2008); Zhang et al (2008) discuss different granularities for Chinese words and suggest a back-off approach. Bai et al (2008) aim for Chinese word segmentation in the training data to match English words one-to-one, while Chang et al (2008) adjust the average word length to optimize translation performance. Xu et al (2008) also use the correspondence to English words in their Bayesian approach.

10.11. Spelling correction
Accents in Spanish, umlauts in German, or diacritics in Arabic may not always be properly included in electronic texts. Machine translation performance may be improved, when training and test data is consistently corrected for such deficiencies (Diab et al, 2007).

10.12. Translating tense, case, and markers
Schiehlen (1998) analyses the translation of tense across languages and warns against a simplistic view of the problem. Murata et al (2001) propose a machine learning method using support vector machines to predict target language tense. Ye et al (2006) propose to use additional features in a conditional random field classifier to determine verb tenses when translating from Chinese to English. Ueffing and Ney (2003) use a pre-processing method to transform the English verb complex to match its Spanish translation more closely. A similar problem is the prediction of case markers in Japanese (Suzuki and Toutanova, 2006), which may be done using a maximum entropy model as part of a treelet translation system (Toutanova and Suzuki, 2007), or the prediction of aspect markers in Chinese, which may framed as classification problem and models with conditional random fields (Ye et al, 2007). Another example of a syntactic markers that are more common in Asian than European languages are numeral classifiers (as in three sheets of paper). Paul et al (2002) present a corpus-based method to generate them for Japanese and Zhang et al (2008) present a method for generating Chinese measure words. Dorr (1994) presents an overview of linguistic differences between languages, called divergences (Gupta and Chatterjee, 2003).

10.13. Translation of chunks and clauses
To approach the problem of the translation of sentences, one strategy is to break up the problem along syntactic lines, be it clauses (Kashioka et al, 2003) or syntactic chunks (Koehn and Knight, 2002; Schafer and Yarowsky, 2003; Schafer and Yarowsky, 2003). This strategy also allows for special components for the translation of more basic syntactic elements such as noun phrases (Koehn and Knight, 2003) or noun compounds (Tanaka and Baldwin, 2003). Owczarzak et al (2006); Mellebeek et al (2006) break up sentence into chunks, translating the chunks separately and combine the translations into a transformed skeleton. Hewavitharana et al (2007) augment a phrase-based model with translations for noun phrases that are translated separately.

10.14. Syntactic pre-reordering
Given the limitations of the dominating phrase-based statistical especially with long-distance reordering for syntactic reasons, it may be better to treat reordering in pre-processing by a hand-crafted component, which has been explored for German–English (Collins et al, 2005), Japanese–English (Komachi et al, 2006), Chinese–English (Wang et al, 2007), and English–Hindi (Ramanathan et al, 2008). Zwarts and Dras (2007) point out that translation improvements are due to both a reduction of reordering needed during decoding and the increased learning of phrases of syntactic dependents. Nguyen and Shimazu (2006) also use manual rules for syntactic transformation in a preprocessing step. Such a reordering component may also be learned automatically from parsed training data, as shown for French–English (Xia and McCord, 2004), Arabic–English (Habash, 2007), and Chinese–English (Crego and Mariño, 2007) — the latter work encodes different orderings in a input lattice to the decoder. Li et al (2007) propose a maximum entropy pre-reordering model based on syntactic parse trees in the source language. It may be beneficial to train different such pre-reordering models for different sentence types (questions etc.) (Zhang et al, 2008). Preprocessing the input to a machine translation system may also include splitting it up into smaller sentences (Lee et al, 2008).

10.15. POS and chunk-based pre-reordering
Reordering patterns may also be learned over part-of-speech tags, allowing the input to be converted into a reordering graph (Crego and Mariño, 2006) or enabling a rescoring approach with the patterns as features (Chen et al, 2006). The reordering rules may also be integrated into an otherwise monotone decoder (Tillmann, 2008). Such rules may also be used in a separate reordering model. Such rules may be based on automatic word classes (Costa-jussà and Fonollosa, 2006; Crego et al, 2006), which was shown to outperform part-of-speech tags (Costa-jussà and Fonollosa, 2007), or they may be based on syntactic chunks (Zhang et al, 2007; Zhang et al, 2007; Crego and Habash, 2008). Scoring for rule applications may be encoded in the reordering graph, or done once the target word order is established which allows for rewarding reorderings that happened due to phrase-internal reordering (Elming, 2008; Elming, 2008).

10.16. Syntactic re-ranking
Linguistic features may be added to an n-best of candidate translation to be exploited by a re-ranking approach. This was explored by Och et al (2004) and by Koehn and Knight (2003) for noun phrases. Sequence models trained on part-of-speech tags (Maynard et al, 2007) or CCG supertags (Hassan et al, 2007) may be also used in re-ranking. To check if the dependency structure is preserved during translation, the number of preserved dependency links or paths may be a useful feature (Nikoulina and Dymetman, 2008).

10.17. Integrated syntactic features
If syntactic properties can be formulated as features that fit into the framework of the beam search decoding algorithm, they may be integrated into model. Gupta et al (2007) propose reordering features for different part-of-speech types, that flag different reordering behavior for nouns, verbs, and adjectives.

10.18. Factored translation
Factored translation models (Koehn and Hoang, 2007) allow the integration of syntactic features into the translation and reordering models. Already earlier work augmented phrase translation tables with part-of-speech information (Lioma and Ounis, 2005). The approach has been shown to be successful for integrating part-of-speech tags, word class factors (Shen et al, 2006), CCG super-tags (Birch et al, 2007), and morphological tags (Badr et al, 2008) for language modeling. Complementary, the source side may be enriched with additional markup, for instance to better predict the right inflections in a morphologically richer output language (Avramidis and Koehn, 2008). More complex factored models for translating morphology have been explored for English–Czech translation (Bojar, 2007).

Chapter 11: Tree-Based Models

11.1. Tree-based without syntax
Rooted in a finite state machine approach, head automata have been developed that allow for tree representation during the translation process (Alshawi, 1996; Alshawi et al, 1997; Alshawi et al, 1998; Alshawi et al, 2000). Wang and Waibel (1998) also induce hierarchical structures in their statistical machine translation model. The use of synchronous grammars for statistical machine translation has been pioneered by Wu (1995); Wu (1996); Wu (1997); Wu and Wong (1998) with their work on inversion transduction grammars (ITG) and stochastic bracketing transduction grammars (SBTG). A different formal model are multitext grammars (Melamed, 2003; Melamed, 2004; Melamed et al, 2004) and range concatenation grammars (Søgaard, 2008). Zhang and Gildea (2005) extend the ITG formalism by lexicalizing grammar rules and apply it to a small-scale word alignment task. Armed with an efficient A* algorithm for this formalism (Zhang and Gildea, 2006), they extend the formalism with rules that are lexicalized in both input and output (Zhang and Gildea, 2006). Word alignments generated with ITG may also be used to extract phrase for phrase-based models (Sánchez and Benedí, 2006; Sánchez and Benedí, 2006). Zhang et al (2008) present an efficient ITG phrase alignment algorithm that uses Bayesian learning with priors.

11.2. Generative models
Inspired by the IBM models, Yamada and Knight (2001) presents a generative tree-based model that is trained using the EM algorithm, thus aligning the words in the parallel corpus while extracting syntactic transfer rules. Syntax trees are provided by automatically parsing the English side of the corpus in a pre-processing step. They also present a chart parsing algorithm for their model (Yamada and Knight, 2002). This model allows the integration of a syntactic language model (Charniak et al, 2003). Gildea (2003) introduce a clone operation to the model and extend it to dependency trees (Gildea, 2004).

11.3. Aligning subtrees
The use of syntactic structure in machine translation is rooted in earlier transfer and example-based approaches. There are many methods to extract subtrees from a parallel corpus, aided either by a word-aligned corpus or a bilingual lexicon and a heuristic to disambiguate alignment points. For instance, such efforts can be traced back to work on the alignment of dependency structures by Matsumoto et al (1993). Related to this are efforts to align syntactic phrases (Yamamoto and Matsumoto, 2000; Imamura, 2001; Imamura et al, 2003; Imamura et al, 2004), hierarchical syntactic phrases (Watanabe and Sumita, 2002; Watanabe et al, 2002), and phrase structure tree fragments (Groves et al, 2004) as well as methods to extract transfer rules, as used in traditional rule-based machine translation systems (Lavoie et al, 2001). The degree to which alignments are consistent with the syntactic structure may be measured by distance in the dependency tree (Nakazawa et al, 2007). Tinsley et al (2007) use a greedy algorithm that uses a probabilistic lexicon trained with the IBM models to align subtrees in a parallel corpus parsed on both sides. Zhechev and Way (2008) compare it against a similar algorithm. Lavie et al (2008) use symmetrized IBM model alignments for the same purpose and discuss effects of alignment and parse quality.

11.4. Hierarchical phrase models
Chiang (2005); Chiang (2007) combines the ideas of phrase-based models and tree structure and proposes an efficient decoding method based on chart parsing. His hierarchical phrase-based model Hiero (Chiang et al, 2005) makes no use of explicit annotation. Hierarchical rules may be extracted from a sentence pairs in linear time (Zhang et al, 2008). Some of the properties of such models and their relationship to phrase-based models are discussed by Zollmann et al (2008). Watanabe et al (2006) propose a model somewhat between traditional phrase models and hierarchical models, which allow for discontinuities in source phrases, but follow a tradition left-to-right search algorithm. They show competitive performance with phrase-based models (Watanabe et al, 2006; Watanabe et al, 2006). Another adaptation of the hierarchical model approach only allows function words (or the most frequent words in the corpus) to occur in rules with nonterminals on the right hand side, which allows a lexicalized but still compact grammar Setiawan et al (2007). As with phrase-based models, instead of setting rule application probability by maximum likelihood estimation, we may train classifiers to include additional features (Subotin, 2008).

11.5. String to tree
Adding syntactic categories to target-side nonterminals in hierarchical phrase models leads to syntax-augmented models (Zollmann et al, 2006). Zollmann and Venugopal (2006) present a chart parse-decoding method for this approach. Galley et al (2004) build translation rules that map input phrases to output tree fragments. Contextually richer rules and learning rule probabilities with the EM algorithm may lead to better performance (Galley et al, 2006). But also adjusting the parse trees to be able to extract rules for all lexical matches may be important — which requires the introduction of additional nonterminal symbols (Marcu et al, 2006) or rules with multiple head nodes (Liu et al, 2007). Instead of using standard Penn treebank labels for nonterminals, relabeling the constituents may lead to the acquisiton of better rules (Huang and Knight, 2006). Since syntactic structure prohibits some phrase pairs that may be learned as syntactic translation rules, leading to less coverage, this may be alleviated by adjusting the rule extraction algorithm (DeNeefe et al, 2007). DeNeefe et al (2005) present an interactive tool to inspect the workings of such syntactic translation models.

11.6. Tree to string
Tree-to-string models use a rich input language representation to translate into word sequences in the output language. As a soft constraint, similarity between the source syntax tree and the derivation tree during decoding may be used (Zhou et al, 2008) or nonterminals in rule applications that match the source syntax tree could be flagged in a feature (Marton and Resnik, 2008). Syntax-direction translation parses the input first and uses the parse structure to guide the output phrase generation (Huang et al, 2006; Huang et al, 2006). Liu and Gildea (2008) extend this work by adding semantic role labels and changing parameter estimation. A simpler method follows the phrase-based approach but informs phrase selection with shallow source trees (Langlais and Gotti, 2006). Given the segmentation into phrases and a source syntax tree, both the structural changes to the tree to fit the phrases and reordering of the restructured phrase tree may inform better translations (Nguyen et al, 2008). Building on the hierarchical phrase based model, Zhang et al (2007) use syntactic annotation of the input to make a distinction between linguistic and non-linguistic phrases. Hopkins and Kuhn (2007) present a tree-to-string model with a best-first search.

11.7. Tree to tree
As for the models mentioned above, rules for tree-to-tree models may be learned from word-aligned parallel corpora. To maximize the number of rules, alignment points from the intersection and union of GIZA++ alignments may be treated differently (Imamura et al, 2005). Probabilistic synchronous tree-insertion grammars (STIG) (Nesson et al, 2006) may be automatically learned without any provided syntactic annotations. Synchronous tree substitution grammars (STSG) map tree fragments in the source to tree fragments in the target (Zhang et al, 2007). Shieber (2007) argue for the use of synchronous tree adjoining grammars (S-TAG) as following the structure of printed bilingual dictionaries. This idea was already proposed by (Shieber and Schabes, 1990). Relaxing the isomorphism between input and output trees leads to the idea of quasi-synchronous grammars (QG) (Smith and Eisner, 2006), which have shown to produce better word alignment quality than IBM models, but not symmetrized IBM models. A similar relaxation is allowing multiple neighboring head nodes in the rules (Zhang et al, 2008; Zhang et al, 2008). Instead of using the 1-best or n-best syntactic parses of the source sentence, a forest of parse trees may be used during translation (Mi et al, 2008).

11.8. Dependency structure
One example for the statistical machine translation models based on dependency structures is the treelet approach (Menezes and Richardson, 2001; Menezes and Quirk, 2005), and other researchers found this a promising direction (Lin, 2004). Tectogrammatical models are also based on dependency trees, but also include morphological analysis and generation (Čmejrek et al, 2003; Eisner, 2003). By mapping arbitrary connected fragments of the dependency tree, the approach may be extended to apply the lessons from phrase-based models by mapping larger fragments (Bojar and Hajič, 2008). This idea was pioneered in the dependency treelet model, which only uses source side dependency structure (Quirk et al, 2005). Such models have been shown to be competitive with phrase-based models (Menezes and Quirk, 2005; Menezes et al, 2006). An extension of this model also allows fragments with gaps that are filled by variables, as in the hierarchical phrase-based model (Xiong et al, 2007). Dependency structure may be also used in a string-to-tree model: Shen et al (2008) use rules that map into dependency tree fragments of neighboring words and additional restrictions, which allows the use of a dependecy-based language model. Translating dependency tree fragments is also the idea behind synchronous dependency insertion grammars. Ding et al (2003); Ding and Palmer (2004) develop methods for aligning dependency trees, and then develop decoding algorithms (Ding and Palmer, 2005). More recent work integrates the use of an n-gram language model during decoding (Ding and Palmer, 2006). The generation of an output string from a dependency structure requires insertion of function words and defining an ordering. Hall and Nemec (2007) present a generative model with a search algorithm that proceeds through stages. Chang and Toutanova (2007) present a discriminatively trained model to the word ordering problem. Mapping into dependency structure and its ordering gives better results than separating the two steps (Menezes and Quirk, 2007). Bojar and Hajič (2008) present a dependency treelet approach trained with the aid of a parallel dependency treebank.

11.9. Context features in rule selection
Chan et al (2007) use a support vector machine to include context features for rules in a hierarchical translation model. Maximum entropy models may be used for the same purpose (Xiong et al, 2008; He et al, 2008). Related work on featuring translation rules is cited in the further reading section of Chapter 5 on phrase-based models.

11.10. Binarizing synchronous grammars
Tree-based models that use syntactic annotation may benefit from binarization of the required rules. Binarization may be driven by the source side grammar (Zhang et al, 2006; Wang et al, 2007). Instead of such synchronous binarization methods, binarization may be also driven by the target side (Huang, 2007). A k-arization method with linear time complexity is proposed by Zhang and Gildea (2007). Nesson et al (2008) present a k-arization method for STAG grammars.

11.11. Decoding
The use of n-gram language models in tree-generating decoding increases computational complexity significantly. One solution is to do a first pass translation without the language model, and then score the pruned search hyper graph in a second pass with the language model (Venugopal et al, 2007). Our presentation of cube pruning follows the description of Huang and Chiang (2007). Some more implementation details are presented by Li and Khudanpur (2008). To enable more efficient pruining, outside cost estimates may be obtained by first decoding with a lower-order n-gram model (Zhang and Gildea, 2008). Xiong et al (2008) introduce reordering constraints based on punctuation and maximum reordering distance for tree-based decoding.

11.12. Finite state implementations
Just as word-based and phrase-based models may be implemented with finite state toolkits, a general framework of tree transducers may subsume many of the proposed tree-based models (Graehl and Knight, 2004).

11.13. Complex models
Syntactical tree-based models may be viewed as a probabilistic building up to more complex machine translation models that use richly annotated syntactic or dependency structure, or even some form of interlingua. These systems are usually build as a series of sequential steps (Zabokrtsky et al, 2008). On the other hand, with such complex models as starting point, probabilistic components may be added. Using traditional rule-based for components for syntactic and semantic analysis for the input and a generation component for the output, allows the training of translation models for mapping of f-structures (Riezler and Maxwell, 2006). Cowan et al (2006) propose a translation model that explicitly models clause structure as aligned extended projections and that is trained discriminatively.

11.14. Parallel treebanks
Cuřín et al (2004) describe an effort to build a parallel corpus with syntactic annotation manually. The Prague Czech–English Dependency Treebank contains additional markup (Čmejrek et al, 2005). Similar projects to develop richly annotated parallel corpora are underway for Chinese–English (Palmer et al, 2005), Japanese–Chinese (Zhang et al, 2005).

11.15. Parse quality
Parse quality has obvious effects on the quality of syntax-based models. Quirk and Corston-Oliver (2006) show that relatively small treebanks of only 250 sentences give decent performance, and while there are gains with larger treebanks, they expect these to be diminishing with treebanks larger than the ones currently available, i.e. over 40,000 sentences.

11.16. Syntactic coherence between languages
Fox (2002); Hwa et al (2002) examine how well the underlying assumption of syntactic coherence between languages hold up in practice. In phrase-based systems, syntactic consitituents are not sufficient to maps units between languages (Koehn et al, 2003). Also, for establishing mappings in a tree-based transfer models, syntactic parsing of each side introduces constraints that may make it more difficult to match up sentences (Zhang and Gildea, 2004). When studying actual parallel sentences, the complexity of syntactic transfer rules to match them up are often more complex than expected (Wellington et al, 2006). Parallel tree-banks are also a source of data to examine the parallelism between two languages, although the parallelism also depends on the annotation standards (Buch-Kromann, 2007). Visulization and searching such treebanks may also provide important insights (Volk et al, 2007).