Prior Releases of
European Parliament Proceedings Parallel Corpus


This page contains information on previous releases of the Europarl corpus. Most users will want to look at the current data instead.

Version 1, the original release, contains data from April 1996 to December 2001.

Version 2 adds January 2002 to September 2003.

Unlike the current release, v1 and v2 are not in UTF-8. All languages excluding Greek are in ISO-8859-1 (Latin 1) encoding. Greek data is in ISO-8859-7.

Version 3 adds October 2003 to October 2006. All data is now in UTF-8.

Version 5 adds November 2007 to October 2009.

Version 6 adds November 2009 to December 2010.


Release v6

On 4 February 2011 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v5

Download


Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

LanguageSentencesWords
Bulgarian 229,649-
Czech 479,63610,770,230
Danish 2,117,83949,615,228
German 1,985,56048,648,697
Greek 1,344,198-
English 2,032,00654,720,731
Spanish 1,942,76155,105,479
Estonian 493,198 9,455,337
Finnish 1,929,05435,799,132
French 2,002,26657,860,307
Hungarian 479,67610,601,411
Italian 1,905,55552,306,430
Lithuanian 493,204 9,731,052
Latvian 473,27610,024,350
Dutch 2,147,19553,459,456
Polish 387,537 8,142,067
Portuguese1,942,70053,799,459
Romanian 224,805 5,891,952
Slovak 487,41610,783,688
Slovene 465,98510,616,127
Swedish 2,037,94545,562,972

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2)SentencesL1 WordsEnglish Words
Bulgarian-English 226,768 - 6,011,944
Czech-English 462,35110,573,98312,296,772
Danish-English 1,785,77546,102,45548,833,481
German-English 1,739,15445,607,26947,978,832
Greek-English 1,064,544 -30,325,647
Spanish-English 1,786,59451,551,48549,411,045
Estonian-English 469,622 9,318,98612,452,336
Finnish-English 1,742,55334,123,01347,601,416
French-English 1,825,07754,568,49950,551,047
Hungarian-English 455,27010,429,93512,111,122
Italian-English 1,737,08149,065,28349,981,015
Lithuanian-English 456,796 9,489,99712,144,335
Latvian-English 453,879 9,854,1241,2051,769
Dutch-English 1,822,03650,315,41249,938,127
Polish-English 448,43310,317,69711,910,117
Portuguese-English1,783,43750,267,74149,634,127
Romanian-English 222,854 5,866,203 5,908,150
Slovak-English 460,78010,602,99812,228,702
Slovene-English 456,81810,475,91312,121,729
Swedish-English 1,678,33341,031,74045,628,613

Release v5

On 20 January 2010 we released a further expanded and improved version of the corpus. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v3 (v4 was only released partially for WMT 2009)

Download


Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

LanguageSentencesWords
Danish 2,009,95847,305,502
German 1,822,73544,688,020
Greek 1,257,518-
English 1,891,91850,978,295
Spanish 1,871,70052,503,808
Finnish 1,834,72734,106,317
French 1,904,61355,088,177
Italian 1,827,09150,161,729
Dutch 2,054,41750,926,645
Portuguese1,849,97351,294,994
Swedish 1,936,39143,291,692

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2)SentencesL1 WordsEnglish Words
Danish-English 1,684,66443,692,76046,282,519
German-English 1,581,10741,587,67043,848,958
Greek-English 960,356-27,468,389
Spanish-English 1,689,85048,860,24246,843,295
Finnish-English 1,646,14332,355,14245,136,552
French-English 1,723,70551,708,80647,915,991
Italian-English 1,635,14046,380,85147,236,441
Dutch-English 1,715,71047,477,37847,166,762
Portuguese-English1,681,99147,621,55247,000,805
Swedish-English 1,570,41138,537,24342,810,628


Release v3

On 28 September 2007 we released a further expanded and improved version of the corpus. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v2

Download

Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

LanguageSentencesWords
Danish1,563,01237,467,445
German1,517,98737,614,344
Greek962,82026,306,875
English1,461,42939,618,240
Spanish1,476,10641,408,300
Finnish1,407,54426,413,278
French1,487,45944,688,872
Italian1,405,28239,504,158
Dutch1,616,10439,778,617
Portuguese1,441,20340,862,310
Swedish1,475,19533,407,005

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2)SentencesL1 WordsL2 Words
Danish-English1,304,94734,169,70736,225,880
German-English1,313,09634,700,36236,663,083
Greek-English662,09018,834,75818,827,241
Spanish-English1,304,11637,870,75136,429,274
Finnish-English1,257,72024,895,79034,802,617
French-English1,334,08041,573,11737,436,222
Italian-English1,251,31536,411,16636,510,033
Dutch-English1,326,41236,784,16836,690,392
Portuguese-English1,287,75737,342,42636,355,907
Swedish-English1,164,53628,882,14232,053,628


Release v2

We released on 4 December 2003 an extended and improved version of the corpus. Most of what is written below for Version 1 still applies.

Changes

Download


Release v1

The goals of the processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we separated out punctuation and identified sentence boundaries. We sentence aligned the data a using tool based on the Church and Gale algorithm.

Size of the corpus

Version 1.1 covers April 1996 to December 2001. It contains roughly 20 million words in 740,000 sentences per language.

Download

Currently available for download:


Test Sets

This common test was used in the Koehn/Och/Marcu ACL 2003 paper. It is taken from Q4/2000 portion of the data (2000-10 to 2000-12), with the other parts used for training.
  • common test set (7MB).

    This is a superset of that test set, with true-casing:

  • common test set 2 (14MB).

    Known Bugs

    Some special HTML entities and noisy characters are not removed from the data.

    Terms of Use

    We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know, if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.