Prior Releases of
European Parliament Proceedings Parallel Corpus
This page contains information on previous releases of the Europarl corpus.
Most users will want to look at the current data instead.
Version 1, the original release, contains data from April 1996 to December 2001.
Version 2 adds January 2002 to September 2003.
Unlike the current release, v1 and v2 are not in UTF-8. All languages excluding Greek are in ISO-8859-1 (Latin 1) encoding. Greek data is in ISO-8859-7.
Version 3 adds October 2003 to October 2006. All data is now
in UTF-8.
Version 5 adds November 2007 to October 2009.
Version 6 adds November 2009 to December 2010.
Release v6
On 4 February 2011 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.
Changes since v5
- added 11/2009 - 12/2010 data, now up to around 50 million words per language
- added corpora for 10 more official languages of more recent EU member countries (Bulgarian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, and Slovene), albeit smaller in size, from 01/2007
- further refined preprocessing, cleaning
Download
- source release (text files with preprocessing tools and sentence aligner), 1.3 GB
- tools (preprocessing tools and sentence aligner only), 8.6 KB
- parallel corpus Bulgarian-English, 23 MB, 01/2007-12/2010
- parallel corpus Czech-English, 43 MB, 01/2007-12/2010
- parallel corpus Danish-English, 164 MB, 04/1996-12/2010
- parallel corpus German-English, 172 MB, 04/1996-12/2010
- parallel corpus Greek-English, 125 MB, 04/1996-12/2010
- parallel corpus Spanish-English, 170 MB, 04/1996-12/2010
- parallel corpus Estonian-English, 41 MB, 01/2007-12/2010
- parallel corpus Finnish-English, 163 MB, 01/1997-12/2010
- parallel corpus French-English, 177 MB, 04/1996-12/2010
- parallel corpus Hungarian-English, 43 MB, 01/2007-12/2010
- parallel corpus Italian-English, 172 MB, 04/1996-12/2010
- parallel corpus Lithuanian-English, 41 MB, 01/2007-12/2010
- parallel corpus Latvian-English, 41 MB, 01/2007-12/2010
- parallel corpus Dutch-English, 174 MB, 04/1996-12/2010
- parallel corpus Polish-English, 42 MB, 01/2007-12/2010
- parallel corpus Portuguese-English, 173 MB, 04/1996-12/2010
- parallel corpus Romanian-English, 21 MB, 01/2007-12/2010
- parallel corpus Slovak-English, 43 MB, 01/2007-12/2010
- parallel corpus Slovene-English, 40 MB, 01/2007-12/2010
- parallel corpus Swedish-English, 155 MB, 01/1997-12/2010
Size of the Corpus
Sizes for single-language data after tokenizing and removing XML.
Language | Sentences | Words |
Bulgarian | 229,649 | - |
Czech | 479,636 | 10,770,230 |
Danish | 2,117,839 | 49,615,228 |
German | 1,985,560 | 48,648,697 |
Greek | 1,344,198 | - |
English | 2,032,006 | 54,720,731 |
Spanish | 1,942,761 | 55,105,479 |
Estonian | 493,198 | 9,455,337 |
Finnish | 1,929,054 | 35,799,132 |
French | 2,002,266 | 57,860,307 |
Hungarian | 479,676 | 10,601,411 |
Italian | 1,905,555 | 52,306,430 |
Lithuanian | 493,204 | 9,731,052 |
Latvian | 473,276 | 10,024,350 |
Dutch | 2,147,195 | 53,459,456 |
Polish | 387,537 | 8,142,067 |
Portuguese | 1,942,700 | 53,799,459 |
Romanian | 224,805 | 5,891,952 |
Slovak | 487,416 | 10,783,688 |
Slovene | 465,985 | 10,616,127 |
Swedish | 2,037,945 | 45,562,972 |
Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.
Parallel Corpus (L1-L2) | Sentences | L1 Words | English Words |
Bulgarian-English | 226,768 | - | 6,011,944 |
Czech-English | 462,351 | 10,573,983 | 12,296,772 |
Danish-English | 1,785,775 | 46,102,455 | 48,833,481 |
German-English | 1,739,154 | 45,607,269 | 47,978,832 |
Greek-English | 1,064,544 | - | 30,325,647 |
Spanish-English | 1,786,594 | 51,551,485 | 49,411,045 |
Estonian-English | 469,622 | 9,318,986 | 12,452,336 |
Finnish-English | 1,742,553 | 34,123,013 | 47,601,416 |
French-English | 1,825,077 | 54,568,499 | 50,551,047 |
Hungarian-English | 455,270 | 10,429,935 | 12,111,122 |
Italian-English | 1,737,081 | 49,065,283 | 49,981,015 |
Lithuanian-English | 456,796 | 9,489,997 | 12,144,335 |
Latvian-English | 453,879 | 9,854,124 | 1,2051,769 |
Dutch-English | 1,822,036 | 50,315,412 | 49,938,127 |
Polish-English | 448,433 | 10,317,697 | 11,910,117 |
Portuguese-English | 1,783,437 | 50,267,741 | 49,634,127 |
Romanian-English | 222,854 | 5,866,203 | 5,908,150 |
Slovak-English | 460,780 | 10,602,998 | 12,228,702 |
Slovene-English | 456,818 | 10,475,913 | 12,121,729 |
Swedish-English | 1,678,333 | 41,031,740 | 45,628,613 |
Release v5
On 20 January 2010 we released a further expanded and improved version of the corpus. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.
Changes since v3 (v4 was only released partially
for WMT 2009)
- added 11/2007 - 10/2009 data, now up to 55 million words per language
- further refined preprocessing, cleaning
Download
- source release (text files with preprocessing tools and sentence aligner), 1616 MB
- tools (preprocessing tools and sentence aligner only), 8.1 KB
- parallel corpus Danish-English, 163 MB, 04/1996-10/2009
- parallel corpus German-English, 164 MB, 04/1996-10/2009
- parallel corpus Greek-English, 120 MB, 04/1996-10/2009
- parallel corpus Spanish-English, 169 MB, 04/1996-10/2009
- parallel corpus Finnish-English, 162 MB, 01/1997-10/2009
- parallel corpus French-English, 176 MB, 04/1996-10/2009
- parallel corpus Italian-English, 170 MB, 04/1996-10/2009
- parallel corpus Dutch-English, 172 MB, 04/1996-10/2009
- parallel corpus Portuguese-English, 172 MB, 04/1996-10/2009
- parallel corpus Swedish-English, 153 MB, 01/1997-10/2009
Size of the Corpus
Sizes for single-language data after tokenizing and removing XML.
Language | Sentences | Words |
Danish | 2,009,958 | 47,305,502 |
German | 1,822,735 | 44,688,020 |
Greek | 1,257,518 | - |
English | 1,891,918 | 50,978,295 |
Spanish | 1,871,700 | 52,503,808 |
Finnish | 1,834,727 | 34,106,317 |
French | 1,904,613 | 55,088,177 |
Italian | 1,827,091 | 50,161,729 |
Dutch | 2,054,417 | 50,926,645 |
Portuguese | 1,849,973 | 51,294,994 |
Swedish | 1,936,391 | 43,291,692 |
Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.
Parallel Corpus (L1-L2) | Sentences | L1 Words | English Words |
Danish-English | 1,684,664 | 43,692,760 | 46,282,519 |
German-English | 1,581,107 | 41,587,670 | 43,848,958 |
Greek-English | 960,356 | - | 27,468,389 |
Spanish-English | 1,689,850 | 48,860,242 | 46,843,295 |
Finnish-English | 1,646,143 | 32,355,142 | 45,136,552 |
French-English | 1,723,705 | 51,708,806 | 47,915,991 |
Italian-English | 1,635,140 | 46,380,851 | 47,236,441 |
Dutch-English | 1,715,710 | 47,477,378 | 47,166,762 |
Portuguese-English | 1,681,991 | 47,621,552 | 47,000,805 |
Swedish-English | 1,570,411 | 38,537,243 | 42,810,628 |
Release v3
On 28 September 2007 we released a further expanded and improved version of the corpus. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.
Changes since v2
- added 10/2003 - 10/2006 data, now up to 44 million words per language
- all data is released in UTF-8 encoding
- some data now includes mark-up information on text's original langauge
- data previously in the wrong language has been detected and removed
- aligned data is not tokenized, but tokenizer is provided
- further refined preprocessing
Download
- source release (text files with preprocessing tools and sentence aligner), 783 MB
- tools (preprocessing tools and sentence aligner only), 8.0 KB
- parallel corpus Danish-English, 126 MB, 04/1996-10/2006
- parallel corpus German-English, 136 MB, 04/1996-10/2006
- parallel corpus Greek-English, 82 MB, 04/1996-10/2006
- parallel corpus Spanish-English, 130 MB, 04/1996-10/2006
- parallel corpus Finnish-English, 124 MB, 01/1997-10/2006
- parallel corpus French-English, 136 MB, 04/1996-10/2006
- parallel corpus Italian-English, 130 MB, 04/1996-10/2006
- parallel corpus Dutch-English, 133 MB, 04/1996-10/2006
- parallel corpus Portuguese-English, 132 MB, 04/1996-10/2006
- parallel corpus Swedish-English, 114 MB, 01/1997-10/2006
Size of the Corpus
Sizes for single-language data after tokenizing and removing XML.
Language | Sentences | Words |
Danish | 1,563,012 | 37,467,445 |
German | 1,517,987 | 37,614,344 |
Greek | 962,820 | 26,306,875 |
English | 1,461,429 | 39,618,240 |
Spanish | 1,476,106 | 41,408,300 |
Finnish | 1,407,544 | 26,413,278 |
French | 1,487,459 | 44,688,872 |
Italian | 1,405,282 | 39,504,158 |
Dutch | 1,616,104 | 39,778,617 |
Portuguese | 1,441,203 | 40,862,310 |
Swedish | 1,475,195 | 33,407,005 |
Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.
Parallel Corpus (L1-L2) | Sentences | L1 Words | L2 Words |
Danish-English | 1,304,947 | 34,169,707 | 36,225,880 |
German-English | 1,313,096 | 34,700,362 | 36,663,083 |
Greek-English | 662,090 | 18,834,758 | 18,827,241 |
Spanish-English | 1,304,116 | 37,870,751 | 36,429,274 |
Finnish-English | 1,257,720 | 24,895,790 | 34,802,617 |
French-English | 1,334,080 | 41,573,117 | 37,436,222 |
Italian-English | 1,251,315 | 36,411,166 | 36,510,033 |
Dutch-English | 1,326,412 | 36,784,168 | 36,690,392 |
Portuguese-English | 1,287,757 | 37,342,426 | 36,355,907 |
Swedish-English | 1,164,536 | 28,882,142 | 32,053,628 |
Release v2
We released on 4 December 2003 an extended and improved version
of the corpus. Most of what is written below for Version 1 still
applies.
Changes
- added 1/2002 - 9/2003 data, now up to 28 million
words per language
- cleaned up preprocessing
- ships with a sentence aligner that allows for the creation
of any parallel corpus between two language pairs and allows you to
plug in your own tokenizer and sentence splitter
Download
- source release (text files with sentence
aligner), 559 MB
- parallel corpus Danish-English, 99 MB, 04/1996-09/2003
- parallel corpus German-English, 105 MB, 04/1996-09/2003
- parallel corpus Greek-English, 75 MB, 04/1996-02/2002
- parallel corpus Spanish-English, 101 MB, 04/1996-09/2003
- parallel corpus Finnish-English, 91 MB, 01/1997-09/2003
- parallel corpus French-English, 103 MB, 04/1996-09/2003
- parallel corpus Italian-English, 101 MB, 04/1996-09/2003
- parallel corpus Dutch-English, 102 MB, 04/1996-09/2003
- parallel corpus Portuguese-English, 102 MB, 04/1996-09/2003
- parallel corpus Swedish-English, 90 MB, 01/1997-09/2003
Release v1
The goals of the processing was to generate sentence aligned text
for statistical machine translation systems. For
this purpose we extracted matching items and labeled them with
corresponding document IDs. Using a preprocessor we separated out
punctuation and identified sentence boundaries. We sentence aligned
the data a using tool based on the Church
and Gale algorithm.
Size of the corpus
Version 1.1 covers April 1996 to December 2001.
It contains roughly 20 million words
in 740,000 sentences per language.
Download
Currently available for download:
- Danish-English:
document aligned (80MB),
sentence aligned (74MB).
- German-English:
document aligned (77MB),
sentence aligned (70MB).
- Greek-English:
document aligned (80MB),
sentence aligned (67MB).
- Spanish-English:
document aligned (83MB),
sentence aligned (75MB).
- Finnish-English:
document aligned (65MB),
sentence aligned (60MB).
- French-English:
document aligned (76MB),
sentence aligned (70MB).
- Dutch-English:
document aligned (82MB),
sentence aligned (74MB).
- Italian-English:
document aligned (81MB),
sentence aligned (73MB).
- Portuguese-English:
document aligned (76MB),
sentence aligned (69MB).
- Swedish-English:
document aligned (71MB),
sentence aligned (61MB).
Test Sets
This common test was used in the
Koehn/Och/Marcu ACL 2003
paper. It is taken from Q4/2000 portion of the data
(2000-10 to 2000-12),
with the other parts used for training.
common test set (7MB).
This is a superset of that test set, with true-casing:
common test set 2 (14MB).
Known Bugs
Some special HTML entities and noisy characters are not
removed from the data.
Terms of Use
We are not aware of any copyright restrictions of the material.
If you use this data in your research, please contact
pkoehn@inf.ed.ac.uk.
Please let us know, if you find problems with the data
or if you want the data for other language pairs.
We recommend using the last quarter of 2000 for testing
(2000-10 until 2000-12) for consistency in reporting
research results on this data.