Shared Task: Bilingual Document Alignment

Motivation

Parallel corpora are especially important for statistical machine translation, but so far the collection of such data within the academic research community has been ad hoc and limited in scale. To promote this research problem we organize a shared task on aligning bilingual documents from crawled web sites.

Task

The task is to identify pairs of English and French documents from a given collection of documents such that one document is the translation of the other. As possible pairs we consider all pairs of documents from the same webdomain for which the source side has been identified as (mostly) English and the target side as (mostly) French.

Terminology

Unfortunately, the notion of domain is ambiguous in NLP applications. To avoid confusion we will instead use the term webdomain to refer to content from a specific website, e.g, "This page is from the statmt.org webdomain". We distinguish between webdomains using their Fully Qualified Domain Name (FQDN). Thus, www.example.com and example.com are considered to be different webdomains.

We will use source to denote English pages and target for French ones. This does not imply that translation was performed in that direction. In fact we cannot know if translation from one side to the other was performed at all, both sides could possibly be translations of a third language document.

Training data

As training data we provide a set of 1,624 EN-FR pairs from 49 webdomains. The number of pairs per webdomain varies between 4 and over 200. All pairs are from within a single webdomain, possible matches between two different webdomains, e.g. siemens.de and siemens.com, are not considered in this task.

We also provide mirrors of all the pages in the webdomains which were crawled using httrack.

All crawls in a single file: lett.train.tgz (8.7G, md5:81b99f1a95a5153009bfd902b99857ab , sha1:f05754d63ba9b3fbc1b8e82bdbb45c3c950be129)

webdomainsource pagestarget pagespossible pairingstraining pairs
bugadacargnel.com.lett.gz 919 779 715,901 19
cbsc.ca 1,595 904 1,441,880 20
cineuropa.mobi 23,050 15,972 368,154,600 73
creationwiki.org 8,417 203 1,708,651 22
eu2007.de 3,201 2,488 7,964,088 11
eu.blizzard.com 10,493 6,640 69,673,520 10
forcesavenir.qc.ca 3,592 3,982 14,303,344 8
galacticchannelings.com 4,231 1,283 5,428,373 9
golftrotter.com 377 361 136,097 8
iiz-dvv.de 1,160 894 1,037,040 67
ironmaidencommentary.com 6,028 635 3,827,780 41
kicktionary.de 2,752 888 2,443,776 29
kustu.com 1,544 1,511 2,332,984 13
manchesterproducts.com 15,621 9,651 150,758,271 10
minelinks.com 736 212 156,032 66
pawpeds.com 983 135 132,705 19
rehazenter.lu 201 317 63,717 16
santabarbara-online.com 1,151 1,099 1,264,949 11
schackportalen.nu 33 29 957 14
tsb.gc.ca 5,885 5,828 34,297,780 236
virtualhospice.ca 43,500 22,327 971,224,500 46
www.acted.org 3,333 2,431 8,102,523 21
www.antennas.biz 812 327 265,524 30
www.artsvivants.ca 5,487 1,368 7,506,216 12
www.bonnke.net 414 129 53,406 27
www.bugadacargnel.com 919 779 715,901 7
www.cgfmanet.org 9,241 6,260 57,848,660 25
www.cyberspaceministry.org 1,534 958 1,469,572 29
www.dakar.com 17,420 14,582 254,018,440 45
www.dfo-mpo.gc.ca 25,277 19,087 482,462,099 97
www.ec.gc.ca 12,266 15,404 188,945,464 26
www.eohu.ca 2,277 2,136 4,863,672 4
www.eu2005.lu 5,649 5,704 32,221,896 34
www.eu2007.de 3,249 2,535 8,236,215 11
www.fao.org 11,931 5,004 59,702,724 6
www.inst.at 3,203 543 1,739,229 62
www.krn.org 115 115 13,225 67
www.lameca.org 692 1,567 1,084,364 6
www.luontoportti.com 3,645 1,796 6,546,420 30
www.nato.int 40,063 8,773 351,472,699 36
www.nauticnews.com 24,325 43,045 1,047,069,625 21
www.pawpeds.com 1,011 136 137,496 43
www.prohelvetia.ch 5,209 4,421 23,028,989 7
www.socialwatch.org 13,803 2,419 33,389,457 21
www.summerlea.ca 434 338 146,692 58
www.the-great-adventure.fr 2,038 2,460 5,013,480 18
www.ushmm.org 10,472 967 10,126,424 26
www.usw.ca.lett 5,006 2,247 11,248,482 83
www.vinci.com 3,564 3,374 12,024,936 24

Data format

The training pairs are one pair per line:

Source_URL<TAB>Target_URL\n

For the crawled data we provide one file per webdomain in .lett format adapted from Bitextor. This is a plain text format with one line per page. Each line consists of 6 tab-separated values:

We make sure that the language id is reliable, at least for the documents in the train and test pairs. We also ensure that all known pairs have been crawled and no URLs are missing from the crawls.

Text extraction was performed using an HTML5 parser. As the original HTML pages are available, participants are welcome to implement their own text extraction, for example to remove boilerplate.

To facilitate use of the .lett files we provide a simple reader class in Python: lett.py.

Additionally, we have identified spans of French text for which we produced English translations using MT. These translations are not part of the lett files but provided seperately: translations.train.gz. The format for the source segments and target segments is

URL<TAB>Text
where the same URL might occur multiple times if several lines/spans of French text were found. The URLs can be used to identify the corresponding documents in the .lett files.

Test data

For testing, we will provide 203 additional crawls of new webdomains, distinct from the ones in the training data in the same format. The official test pairs were released after the completion of the evaluation campaign.

All test lett files in one file: lett.test.tgz ((16G, md5:a16f7afdcf7de3c4bc992f0451ef89a3 , sha1:6b978ee34c8803876e0dad5b760eccc3957e3a5e)
As for the train data we provide translations of all French text spans: translations.test.tgz

Test set statistics:

webdomain source pages target pages possible pairings
www.domainepechlaurier.com 24 26 624
www.desmarais-robitaille.com 9,496 10,621 100,857,016
italiasullarete.it 3,630 2,959 10,741,170
egodesign.ca 11,376 7,384 84,000,384
www.gameonly.com 1,552 9,953 15,447,056
www.ledindon.com 1,995 2,019 4,027,905
ww-corp.com 53 34 1,802
www.specialimmo.com 92 92 8,464
www.festivalstoria.it 8 4 32
conancompletist.com 163 107 17,441
www.fsm-group.de 128 77 9,856
www.dvv-international.de 1,449 1,131 1,638,819
www.cornwall.ca 610 458 279,380
www.actualites-electroniques.com 60,906 834 50,795,604
www.epilepsiemuseum.org 214 178 38,092
www.lalettrediplomatique.fr 239 4,300 1,027,700
www.laprairie.ch 245 242 59,290
aucoeurduvin.com 19 50 950
bateaux-de-saint-malo.com 336 323 108,528
www.eaglebrand.ca 291 289 84,099
www.casaholidaysilvia.com 27 20 540
aeht.org 840 687 577,080
harmonie.cz 9 3 27
www.lifegrid.fr 211 276 58,236
www.dcc-cdc.gc.ca 669 668 446,892
selecta-tech.de 891 788 702,108
www.nserc-crsng.gc.ca 7,237 5,283 38,233,071
myriad-online.com 5,923 4,258 25,220,134
www.international.icomos.org 8,064 6,797 54,811,008
www.ottawariver.org 48 34 1,632
www.sirc-csars.gc.ca 569 546 310,674
communaute-dame.qc.ca 360 239 86,040
www.verderber.org 23 23 529
buehrle.ch 531 472 250,632
www.oacc.info 2,807 1,504 4,221,728
www.info-turk.be 186 242 45,012
www.lopera.be 8 5 40
busaroundglobe.com 851 633 538,683
www.consulfrance-atlanta.org 2,502 2,240 5,604,480
www.eurozine.com 18,096 168 3,040,128
artfactories.net 2,847 91,451 260,360,997
www.musee-mccord.qc.ca 4,265 9,774 41,686,110
vade-retro.fr 6 10 60
bjnewlife.org 12,869 1,129 14,529,101
panhuasca.org.br 121 116 14,036
www.hydrel.ch 3,421 1,480 5,063,080
mammusique.com 39 34 1,326
www.iaapa.de 40 1 40
sustainability.suncor.com 771 768 592,128
www.haro.com 2,125 1,022 2,171,750
www.swtor.com 2,773 644 1,785,812
www.genievres.com 90 113 10,170
www.lupusae.com 2,187 451 986,337
www.portboulogne.com 105 135 14,175
www.palaminy.com 8 10 80
hrcouncil.ca 381 389 148,209
www.axa.com 3,541 3,930 13,916,130
plongeecavalaire.com 13 14 182
www.presepiovenegono.it 399 372 148,428
www.fasska.com 49 50 2,450
www.rcmp-grc.gc.ca 10,952 10,914 119,530,128
www.lecanville.com 29 24 696
harasdesfrettes.com 110 114 12,540
www.mecanelec.com 147 137 20,139
projectavalon.net 5,475 271 1,483,725
www.carus-verlag.com 1,912 4,191 8,013,192
www.parischoralsociety.org 27 24 648
www.world-governance.org 2,840 2,481 7,046,040
www.eurovia.org 1,862 1,994 3,712,828
la-coulonniere.fr 21 19 399
www.jerome-alquie.com 90 71 6,390
www.confrontations.info 99 2,680 265,320
www.zigiz.com 502 162 81,324
www.lvbeethoven.com 341 275 93,775
www.coe.int 9,568 1,852 17,719,936
bopsecrets.org 563 95 53,485
www.kazior5.com 7 5 35
www.elenacaffe1863.com 3 4 12
www.pc.gc.ca 5,728 8,645 49,518,560
www.santegidio.org 4,198 2,051 8,610,098
wise.net 98 91 8,918
www.antebiel.com 91 252 22,932
www.cra-arc.gc.ca 14,000 9,351 130,914,000
fourdirectionsteachings.com 38 34 1,292
www.aertssen.com 158 79 12,482
www.plume-noire.com 1,568 1,057 1,657,376
ottawaheart.ca 953 792 754,776
www.redciencia.cu 297 273 81,081
www.technip.com 1,231 1,022 1,258,082
www.mayafiles.com 83 60 4,980
chablis-geoffroy.com 69 70 4,830
shorterworkweek.com 58 50 2,900
www.directkite.com 170 81 13,770
www.tatin.org 104 137 14,248
poubille.fr 7 122 854
www.good-will.ch 150 145 21,750
www.provencegiteventoux.fr 17 16 272
belnois.com 56 63 3,528
histalu.org 9,909 9,265 91,806,885
www.eufic.org 11,349 4,823 54,736,227
peregrinoslh.com 7 2 14
www.planet-diversity.org 573 73 41,829
www.toucherdubois.ca 3,148 3,249 10,227,852
www.dan42.com 112 90 10,080
cmhg-phmc.gc.ca 3,016 1,728 5,211,648
sciences.amatheurs.fr 91 118 10,738
sebsauvage.net 25,995 29,194 758,898,030
jointhealth.org 783 492 385,236
www.undefine.ca 69 13 897
cinedoc.org 49,672 49,215 2,444,607,480
www.ccpsa.ca 127 107 13,589
gadal-catharisme.org 132 129 17,028
www.tsb-bst.gc.ca 5,311 5,230 27,776,530
www.redcross.int 1,229 1,176 1,445,304
www.dural.de 229 184 42,136
www.garoo.net 16,984 9,939 168,803,976
www.raab-gruppe.de 55 62 3,410
www.nostalgic-images.co.uk 6,933 2,694 18,677,502
www.generalhieu.com 494 346 170,924
massviolence.org 1,489 497 740,033
julieguenette.com 50 16 800
arabpressnetwork.org 2,011 654 1,315,194
www.rfimusique.com 659 11,891 7,836,169
www.pjo.ca 26 18 468
fmnews.com 82 82 6,724
www.phytoclick.com 678 568 385,104
varengeville-sur-mer.fr 13 84 1,092
www.wipo.int 23,763 1,068 25,378,884
www.arabhumanrights.org 165 11 1,815
www.oras.com 4,040 1,518 6,132,720
www.educweb.org 1,609 4,482 7,211,538
www.conidia.fr 78 92 7,176
www.oag-bvg.gc.ca 10,872 10,865 118,124,280
www.kinnarps.com 11,547 3,118 36,003,546
www.iisd.ca 16,856 3,989 67,238,584
technologeeko.com 4,715 1,593 7,510,995
www.phares-balises.fr 996 572 569,712
www.metisse-music.com 667 634 422,878
www.bergerfoundation.ch 9,843 2,135 21,014,805
www.ipu.org 7,898 7,800 61,604,400
raken.com 5,518 3,435 18,954,330
www.afdb.org 29,445 28,209 830,614,005
www.gunt.de 9,771 4,784 46,744,464
lautsprecherversand.de 3,020 734 2,216,680
animalaidepontiac.ca 46 46 2,116
www.unv.org 16,150 6,390 103,198,500
www.lacliniqueducoureur.ca 416 428 178,048
www.lagardere.com 4,587 6,247 28,654,989
justin-time.com 2,267 310 702,770
brettspielwelt.de 5,810 323 1,876,630
www.bespoke.co.uk 2,724 512 1,394,688
www.multisailing.com 158 158 24,964
milltowndowntown.com 3,871 748 2,895,508
classicalguitarmidi.com 22 13 286
www.rustywords.net 183 142 25,986
www.tsunamichain.org 144 25 3,600
maurelles.com 17 15 255
www.maitresserokeuse.com 14 13 182
www.juratourisme.ch 510 525 267,750
1d-aquitaine.com 7,228 6,971 50,386,388
www.momento-films.com 69 72 4,968
www.le-cheval-bleu.com 847 1,146 970,662
www.sagaplanet.com 2,465 1,618 3,988,370
equilibrium-economicum.net 76 57 4,332
monalisa-prod.com 205 185 37,925
www.nametauinnu.ca 510 511 260,610
www.chermette.fr 41 54 2,214
www.tanbou.com 171 210 35,910
www.documentamusica.de 76 15 1,140
rollproductions.com 331 299 98,969
raison-publique.fr 144 13,437 1,934,928
www.publictendering.com 7,848 1,653 12,972,744
academiedesprez.org 263 334 87,842
daisukebike.be 450 17 7,650
www.aeht.eu 845 687 580,515
www.lesgaleriesdanjou.ca 412 459 189,108
www.taize.fr 7,808 2,502 19,535,616
www.wir-sind-kirche.de 334 34 11,356
www.waters-of-life.net 1,584 661 1,047,024
www.latexale.com 80 24 1,920
www.vaslui-turism.ro 78 110 8,580
meatballwiki.org 18,617 847 15,768,599
tulliana.eu 345 222 76,590
txt.ca 25 21 525
www.onphi.net 15 202 3,030
www.millauenjazz.org 346 316 109,336
www.burkiclinic.com 77 68 5,236
aloe.socioeco.org 287 397 113,939
maismemoria.org 3 2 6
www.atlantiqueberlines.com 101 82 8,282
www.ofm.org 7,120 111 790,320
passioncompassion1418.com 1,120 1,685 1,887,200
www.molior.ca 126 121 15,246
www.cirad.fr 2,075 2,787 5,783,025
caamp.org 13,958 9,174 128,050,692
www.transatbtob.com 1,497 255 381,735
www.fasken.com 23,860 13,143 313,591,980
www.luding.ru 1,996 1,204 2,403,184
www.lucistrust.org 13,794 4,335 59,796,990
palmerasyjardines.com 2,496 2,442 6,095,232
bushmeat.net 60 2 120
hagalleria.com 60 36 2,160
www.musee-orsay.fr 470 3,686 1,732,420

Evaluation

Our main metric is recall on the test set, i.e. what percentage of the test-set pairs is found in a submission after enforcing the 1-1 rule above.

Participants are expected to produce a list of possible pairings in the format of the training data. Each source url may be matched with at most one target url and vice-versa. Should a URL occur repeatedly, later occurrences are ignored. We provide an evaluation script to assess performance during development.

Submission [NEW May 1st]

Submission of results is done via email to both Christian Buck and Philipp Koehn. Since results files might grow quite big, it is advisable to use compression or provide a download link. Submissions should be accompanied by a short name of either the participant (e.g. UEDIN) or system. The file format is the same as the training pairs, one pair per line, separated by TAB. The order of pairs is used to enforce the 1-1 rule. Pairs can be source-target or target-source.

Baseline

We provide a simple baseline based on URL matching. Get the code from github:

git clone https://github.com/christianbuck/wmt16-document-alignment-task.git

The baseline iterates through all URLs and strips language identifiers such as /english/ or ?lang=FR from URLs and then produces pairs of URLs that have the same stripped representation. Check test_languagestripper.py for some examples.

To run the baseline first download all .lett.gz files and adjust the path in baseline.sh. Running the script should yield the following output:
$ ./baseline.sh 
Read 222509/345106/224267 fr/en/other URLs from stdin
Read 1624 url pairs from train.pairs
0 urls missing from candidate URLs
160219/229406 stripped source/target urls
Found 8144 stripped source + unmodified target pairs (total: 8144) covering 107 pairs from devset
Found 9119 stripped target + unmodified source pairs (total: 17263) covering 121 pairs from devset
Found 126588 stripped source + stripped target pairs (total: 143851) covering 1131 pairs from devset
Total: 143851 candidate pairs
Keeping 119979 pairs after enforcing 1-1 rule
1103 pairs from devset

Running eval

Read 1624 reference pairs from train.pairs
Read 119979 predicted pairs from predicted.pairs
Keeping 119979 pairs after enforcing 1-1 rule
Found 1103 (67.92%) pairs from reference

Important dates

Release of training data February 12, 2016
Release of test data April 11, 2016
Results submission deadline May 2, 2016
Paper submission deadlineMay 15, 2016
Notification of acceptanceJune 5, 2016
Camera-ready deadlineJune 22, 2016

Organisers

Christian Buck (University of Edinburgh)
Philipp Koehn (Johns Hopkins University)

Sponsor

This shared task is partially supported by a Google Faculty Research Award.