Unfortunately, the notion of domain is ambiguous in NLP applications. To avoid confusion we will instead use the term webdomain to refer to content from a specific website, e.g, "This page is from the statmt.org webdomain". We distinguish between webdomains using their Fully Qualified Domain Name (FQDN). Thus, www.example.com and example.com are considered to be different webdomains.
We will use source to denote English pages and target for French ones. This does not imply that translation was performed in that direction. In fact we cannot know if translation from one side to the other was performed at all, both sides could possibly be translations of a third language document.
As training data we provide a set of 1,624 EN-FR pairs from 49 webdomains. The number of pairs per webdomain varies between 4 and over 200. All pairs are from within a single webdomain, possible matches between two different webdomains, e.g. siemens.de and siemens.com, are not considered in this task.
We also provide mirrors of all the pages in the webdomains which were crawled using httrack.
All crawls in a single file: lett.train.tgz (8.7G, md5:81b99f1a95a5153009bfd902b99857ab , sha1:f05754d63ba9b3fbc1b8e82bdbb45c3c950be129)
webdomain | source pages | target pages | possible pairings | training pairs |
---|---|---|---|---|
bugadacargnel.com.lett.gz | 919 | 779 | 715,901 | 19 |
cbsc.ca | 1,595 | 904 | 1,441,880 | 20 |
cineuropa.mobi | 23,050 | 15,972 | 368,154,600 | 73 |
creationwiki.org | 8,417 | 203 | 1,708,651 | 22 |
eu2007.de | 3,201 | 2,488 | 7,964,088 | 11 |
eu.blizzard.com | 10,493 | 6,640 | 69,673,520 | 10 |
forcesavenir.qc.ca | 3,592 | 3,982 | 14,303,344 | 8 |
galacticchannelings.com | 4,231 | 1,283 | 5,428,373 | 9 |
golftrotter.com | 377 | 361 | 136,097 | 8 |
iiz-dvv.de | 1,160 | 894 | 1,037,040 | 67 |
ironmaidencommentary.com | 6,028 | 635 | 3,827,780 | 41 |
kicktionary.de | 2,752 | 888 | 2,443,776 | 29 |
kustu.com | 1,544 | 1,511 | 2,332,984 | 13 |
manchesterproducts.com | 15,621 | 9,651 | 150,758,271 | 10 |
minelinks.com | 736 | 212 | 156,032 | 66 |
pawpeds.com | 983 | 135 | 132,705 | 19 |
rehazenter.lu | 201 | 317 | 63,717 | 16 |
santabarbara-online.com | 1,151 | 1,099 | 1,264,949 | 11 |
schackportalen.nu | 33 | 29 | 957 | 14 |
tsb.gc.ca | 5,885 | 5,828 | 34,297,780 | 236 |
virtualhospice.ca | 43,500 | 22,327 | 971,224,500 | 46 |
www.acted.org | 3,333 | 2,431 | 8,102,523 | 21 |
www.antennas.biz | 812 | 327 | 265,524 | 30 |
www.artsvivants.ca | 5,487 | 1,368 | 7,506,216 | 12 |
www.bonnke.net | 414 | 129 | 53,406 | 27 |
www.bugadacargnel.com | 919 | 779 | 715,901 | 7 |
www.cgfmanet.org | 9,241 | 6,260 | 57,848,660 | 25 |
www.cyberspaceministry.org | 1,534 | 958 | 1,469,572 | 29 |
www.dakar.com | 17,420 | 14,582 | 254,018,440 | 45 |
www.dfo-mpo.gc.ca | 25,277 | 19,087 | 482,462,099 | 97 |
www.ec.gc.ca | 12,266 | 15,404 | 188,945,464 | 26 |
www.eohu.ca | 2,277 | 2,136 | 4,863,672 | 4 |
www.eu2005.lu | 5,649 | 5,704 | 32,221,896 | 34 |
www.eu2007.de | 3,249 | 2,535 | 8,236,215 | 11 |
www.fao.org | 11,931 | 5,004 | 59,702,724 | 6 |
www.inst.at | 3,203 | 543 | 1,739,229 | 62 |
www.krn.org | 115 | 115 | 13,225 | 67 |
www.lameca.org | 692 | 1,567 | 1,084,364 | 6 |
www.luontoportti.com | 3,645 | 1,796 | 6,546,420 | 30 |
www.nato.int | 40,063 | 8,773 | 351,472,699 | 36 |
www.nauticnews.com | 24,325 | 43,045 | 1,047,069,625 | 21 |
www.pawpeds.com | 1,011 | 136 | 137,496 | 43 |
www.prohelvetia.ch | 5,209 | 4,421 | 23,028,989 | 7 |
www.socialwatch.org | 13,803 | 2,419 | 33,389,457 | 21 |
www.summerlea.ca | 434 | 338 | 146,692 | 58 |
www.the-great-adventure.fr | 2,038 | 2,460 | 5,013,480 | 18 |
www.ushmm.org | 10,472 | 967 | 10,126,424 | 26 |
www.usw.ca.lett | 5,006 | 2,247 | 11,248,482 | 83 |
www.vinci.com | 3,564 | 3,374 | 12,024,936 | 24 |
The training pairs are one pair per line:
Source_URL<TAB>Target_URL\n
For the crawled data we provide one file per webdomain in .lett format adapted from Bitextor. This is a plain text format with one line per page. Each line consists of 6 tab-separated values:
en
)text/html
)charset=utf-8
)We make sure that the language id is reliable, at least for the documents in the train and test pairs. We also ensure that all known pairs have been crawled and no URLs are missing from the crawls.
Text extraction was performed using an HTML5 parser. As the original HTML pages are available, participants are welcome to implement their own text extraction, for example to remove boilerplate.
To facilitate use of the .lett
files we provide a simple reader class in Python: lett.py
.
Additionally, we have identified spans of French text for which we produced English translations using MT. These translations are not part of the lett files but provided seperately: translations.train.gz. The format for the source segments and target segments is
URL<TAB>Textwhere the same URL might occur multiple times if several lines/spans of French text were found. The URLs can be used to identify the corresponding documents in the .lett files.
For testing, we will provide 203 additional crawls of new webdomains, distinct from the ones in the training data in the same format. The official test pairs were released after the completion of the evaluation campaign.
All test lett files in one file: lett.test.tgz ((16G, md5:a16f7afdcf7de3c4bc992f0451ef89a3 , sha1:6b978ee34c8803876e0dad5b760eccc3957e3a5e)
As for the train data we provide translations of all French text spans: translations.test.tgz
Test set statistics:
webdomain | source pages | target pages | possible pairings |
---|---|---|---|
www.domainepechlaurier.com | 24 | 26 | 624 |
www.desmarais-robitaille.com | 9,496 | 10,621 | 100,857,016 |
italiasullarete.it | 3,630 | 2,959 | 10,741,170 |
egodesign.ca | 11,376 | 7,384 | 84,000,384 |
www.gameonly.com | 1,552 | 9,953 | 15,447,056 |
www.ledindon.com | 1,995 | 2,019 | 4,027,905 |
ww-corp.com | 53 | 34 | 1,802 |
www.specialimmo.com | 92 | 92 | 8,464 |
www.festivalstoria.it | 8 | 4 | 32 |
conancompletist.com | 163 | 107 | 17,441 |
www.fsm-group.de | 128 | 77 | 9,856 |
www.dvv-international.de | 1,449 | 1,131 | 1,638,819 |
www.cornwall.ca | 610 | 458 | 279,380 |
www.actualites-electroniques.com | 60,906 | 834 | 50,795,604 |
www.epilepsiemuseum.org | 214 | 178 | 38,092 |
www.lalettrediplomatique.fr | 239 | 4,300 | 1,027,700 |
www.laprairie.ch | 245 | 242 | 59,290 |
aucoeurduvin.com | 19 | 50 | 950 |
bateaux-de-saint-malo.com | 336 | 323 | 108,528 |
www.eaglebrand.ca | 291 | 289 | 84,099 |
www.casaholidaysilvia.com | 27 | 20 | 540 |
aeht.org | 840 | 687 | 577,080 |
harmonie.cz | 9 | 3 | 27 |
www.lifegrid.fr | 211 | 276 | 58,236 |
www.dcc-cdc.gc.ca | 669 | 668 | 446,892 |
selecta-tech.de | 891 | 788 | 702,108 |
www.nserc-crsng.gc.ca | 7,237 | 5,283 | 38,233,071 |
myriad-online.com | 5,923 | 4,258 | 25,220,134 |
www.international.icomos.org | 8,064 | 6,797 | 54,811,008 |
www.ottawariver.org | 48 | 34 | 1,632 |
www.sirc-csars.gc.ca | 569 | 546 | 310,674 |
communaute-dame.qc.ca | 360 | 239 | 86,040 |
www.verderber.org | 23 | 23 | 529 |
buehrle.ch | 531 | 472 | 250,632 |
www.oacc.info | 2,807 | 1,504 | 4,221,728 |
www.info-turk.be | 186 | 242 | 45,012 |
www.lopera.be | 8 | 5 | 40 |
busaroundglobe.com | 851 | 633 | 538,683 |
www.consulfrance-atlanta.org | 2,502 | 2,240 | 5,604,480 |
www.eurozine.com | 18,096 | 168 | 3,040,128 |
artfactories.net | 2,847 | 91,451 | 260,360,997 |
www.musee-mccord.qc.ca | 4,265 | 9,774 | 41,686,110 |
vade-retro.fr | 6 | 10 | 60 |
bjnewlife.org | 12,869 | 1,129 | 14,529,101 |
panhuasca.org.br | 121 | 116 | 14,036 |
www.hydrel.ch | 3,421 | 1,480 | 5,063,080 |
mammusique.com | 39 | 34 | 1,326 |
www.iaapa.de | 40 | 1 | 40 |
sustainability.suncor.com | 771 | 768 | 592,128 |
www.haro.com | 2,125 | 1,022 | 2,171,750 |
www.swtor.com | 2,773 | 644 | 1,785,812 |
www.genievres.com | 90 | 113 | 10,170 |
www.lupusae.com | 2,187 | 451 | 986,337 |
www.portboulogne.com | 105 | 135 | 14,175 |
www.palaminy.com | 8 | 10 | 80 |
hrcouncil.ca | 381 | 389 | 148,209 |
www.axa.com | 3,541 | 3,930 | 13,916,130 |
plongeecavalaire.com | 13 | 14 | 182 |
www.presepiovenegono.it | 399 | 372 | 148,428 |
www.fasska.com | 49 | 50 | 2,450 |
www.rcmp-grc.gc.ca | 10,952 | 10,914 | 119,530,128 |
www.lecanville.com | 29 | 24 | 696 |
harasdesfrettes.com | 110 | 114 | 12,540 |
www.mecanelec.com | 147 | 137 | 20,139 |
projectavalon.net | 5,475 | 271 | 1,483,725 |
www.carus-verlag.com | 1,912 | 4,191 | 8,013,192 |
www.parischoralsociety.org | 27 | 24 | 648 |
www.world-governance.org | 2,840 | 2,481 | 7,046,040 |
www.eurovia.org | 1,862 | 1,994 | 3,712,828 |
la-coulonniere.fr | 21 | 19 | 399 |
www.jerome-alquie.com | 90 | 71 | 6,390 |
www.confrontations.info | 99 | 2,680 | 265,320 |
www.zigiz.com | 502 | 162 | 81,324 |
www.lvbeethoven.com | 341 | 275 | 93,775 |
www.coe.int | 9,568 | 1,852 | 17,719,936 |
bopsecrets.org | 563 | 95 | 53,485 |
www.kazior5.com | 7 | 5 | 35 |
www.elenacaffe1863.com | 3 | 4 | 12 |
www.pc.gc.ca | 5,728 | 8,645 | 49,518,560 |
www.santegidio.org | 4,198 | 2,051 | 8,610,098 |
wise.net | 98 | 91 | 8,918 |
www.antebiel.com | 91 | 252 | 22,932 |
www.cra-arc.gc.ca | 14,000 | 9,351 | 130,914,000 |
fourdirectionsteachings.com | 38 | 34 | 1,292 |
www.aertssen.com | 158 | 79 | 12,482 |
www.plume-noire.com | 1,568 | 1,057 | 1,657,376 |
ottawaheart.ca | 953 | 792 | 754,776 |
www.redciencia.cu | 297 | 273 | 81,081 |
www.technip.com | 1,231 | 1,022 | 1,258,082 |
www.mayafiles.com | 83 | 60 | 4,980 |
chablis-geoffroy.com | 69 | 70 | 4,830 |
shorterworkweek.com | 58 | 50 | 2,900 |
www.directkite.com | 170 | 81 | 13,770 |
www.tatin.org | 104 | 137 | 14,248 |
poubille.fr | 7 | 122 | 854 |
www.good-will.ch | 150 | 145 | 21,750 |
www.provencegiteventoux.fr | 17 | 16 | 272 |
belnois.com | 56 | 63 | 3,528 |
histalu.org | 9,909 | 9,265 | 91,806,885 |
www.eufic.org | 11,349 | 4,823 | 54,736,227 |
peregrinoslh.com | 7 | 2 | 14 |
www.planet-diversity.org | 573 | 73 | 41,829 |
www.toucherdubois.ca | 3,148 | 3,249 | 10,227,852 |
www.dan42.com | 112 | 90 | 10,080 |
cmhg-phmc.gc.ca | 3,016 | 1,728 | 5,211,648 |
sciences.amatheurs.fr | 91 | 118 | 10,738 |
sebsauvage.net | 25,995 | 29,194 | 758,898,030 |
jointhealth.org | 783 | 492 | 385,236 |
www.undefine.ca | 69 | 13 | 897 |
cinedoc.org | 49,672 | 49,215 | 2,444,607,480 |
www.ccpsa.ca | 127 | 107 | 13,589 |
gadal-catharisme.org | 132 | 129 | 17,028 |
www.tsb-bst.gc.ca | 5,311 | 5,230 | 27,776,530 |
www.redcross.int | 1,229 | 1,176 | 1,445,304 |
www.dural.de | 229 | 184 | 42,136 |
www.garoo.net | 16,984 | 9,939 | 168,803,976 |
www.raab-gruppe.de | 55 | 62 | 3,410 |
www.nostalgic-images.co.uk | 6,933 | 2,694 | 18,677,502 |
www.generalhieu.com | 494 | 346 | 170,924 |
massviolence.org | 1,489 | 497 | 740,033 |
julieguenette.com | 50 | 16 | 800 |
arabpressnetwork.org | 2,011 | 654 | 1,315,194 |
www.rfimusique.com | 659 | 11,891 | 7,836,169 |
www.pjo.ca | 26 | 18 | 468 |
fmnews.com | 82 | 82 | 6,724 |
www.phytoclick.com | 678 | 568 | 385,104 |
varengeville-sur-mer.fr | 13 | 84 | 1,092 |
www.wipo.int | 23,763 | 1,068 | 25,378,884 |
www.arabhumanrights.org | 165 | 11 | 1,815 |
www.oras.com | 4,040 | 1,518 | 6,132,720 |
www.educweb.org | 1,609 | 4,482 | 7,211,538 |
www.conidia.fr | 78 | 92 | 7,176 |
www.oag-bvg.gc.ca | 10,872 | 10,865 | 118,124,280 |
www.kinnarps.com | 11,547 | 3,118 | 36,003,546 |
www.iisd.ca | 16,856 | 3,989 | 67,238,584 |
technologeeko.com | 4,715 | 1,593 | 7,510,995 |
www.phares-balises.fr | 996 | 572 | 569,712 |
www.metisse-music.com | 667 | 634 | 422,878 |
www.bergerfoundation.ch | 9,843 | 2,135 | 21,014,805 |
www.ipu.org | 7,898 | 7,800 | 61,604,400 |
raken.com | 5,518 | 3,435 | 18,954,330 |
www.afdb.org | 29,445 | 28,209 | 830,614,005 |
www.gunt.de | 9,771 | 4,784 | 46,744,464 |
lautsprecherversand.de | 3,020 | 734 | 2,216,680 |
animalaidepontiac.ca | 46 | 46 | 2,116 |
www.unv.org | 16,150 | 6,390 | 103,198,500 |
www.lacliniqueducoureur.ca | 416 | 428 | 178,048 |
www.lagardere.com | 4,587 | 6,247 | 28,654,989 |
justin-time.com | 2,267 | 310 | 702,770 |
brettspielwelt.de | 5,810 | 323 | 1,876,630 |
www.bespoke.co.uk | 2,724 | 512 | 1,394,688 |
www.multisailing.com | 158 | 158 | 24,964 |
milltowndowntown.com | 3,871 | 748 | 2,895,508 |
classicalguitarmidi.com | 22 | 13 | 286 |
www.rustywords.net | 183 | 142 | 25,986 |
www.tsunamichain.org | 144 | 25 | 3,600 |
maurelles.com | 17 | 15 | 255 |
www.maitresserokeuse.com | 14 | 13 | 182 |
www.juratourisme.ch | 510 | 525 | 267,750 |
1d-aquitaine.com | 7,228 | 6,971 | 50,386,388 |
www.momento-films.com | 69 | 72 | 4,968 |
www.le-cheval-bleu.com | 847 | 1,146 | 970,662 |
www.sagaplanet.com | 2,465 | 1,618 | 3,988,370 |
equilibrium-economicum.net | 76 | 57 | 4,332 |
monalisa-prod.com | 205 | 185 | 37,925 |
www.nametauinnu.ca | 510 | 511 | 260,610 |
www.chermette.fr | 41 | 54 | 2,214 |
www.tanbou.com | 171 | 210 | 35,910 |
www.documentamusica.de | 76 | 15 | 1,140 |
rollproductions.com | 331 | 299 | 98,969 |
raison-publique.fr | 144 | 13,437 | 1,934,928 |
www.publictendering.com | 7,848 | 1,653 | 12,972,744 |
academiedesprez.org | 263 | 334 | 87,842 |
daisukebike.be | 450 | 17 | 7,650 |
www.aeht.eu | 845 | 687 | 580,515 |
www.lesgaleriesdanjou.ca | 412 | 459 | 189,108 |
www.taize.fr | 7,808 | 2,502 | 19,535,616 |
www.wir-sind-kirche.de | 334 | 34 | 11,356 |
www.waters-of-life.net | 1,584 | 661 | 1,047,024 |
www.latexale.com | 80 | 24 | 1,920 |
www.vaslui-turism.ro | 78 | 110 | 8,580 |
meatballwiki.org | 18,617 | 847 | 15,768,599 |
tulliana.eu | 345 | 222 | 76,590 |
txt.ca | 25 | 21 | 525 |
www.onphi.net | 15 | 202 | 3,030 |
www.millauenjazz.org | 346 | 316 | 109,336 |
www.burkiclinic.com | 77 | 68 | 5,236 |
aloe.socioeco.org | 287 | 397 | 113,939 |
maismemoria.org | 3 | 2 | 6 |
www.atlantiqueberlines.com | 101 | 82 | 8,282 |
www.ofm.org | 7,120 | 111 | 790,320 |
passioncompassion1418.com | 1,120 | 1,685 | 1,887,200 |
www.molior.ca | 126 | 121 | 15,246 |
www.cirad.fr | 2,075 | 2,787 | 5,783,025 |
caamp.org | 13,958 | 9,174 | 128,050,692 |
www.transatbtob.com | 1,497 | 255 | 381,735 |
www.fasken.com | 23,860 | 13,143 | 313,591,980 |
www.luding.ru | 1,996 | 1,204 | 2,403,184 |
www.lucistrust.org | 13,794 | 4,335 | 59,796,990 |
palmerasyjardines.com | 2,496 | 2,442 | 6,095,232 |
bushmeat.net | 60 | 2 | 120 |
hagalleria.com | 60 | 36 | 2,160 |
www.musee-orsay.fr | 470 | 3,686 | 1,732,420 |
Our main metric is recall on the test set, i.e. what percentage of the test-set pairs is found in a submission after enforcing the 1-1 rule above.
Participants are expected to produce a list of possible pairings in the format of the training data. Each source url may be matched with at most one target url and vice-versa. Should a URL occur repeatedly, later occurrences are ignored. We provide an evaluation script to assess performance during development.
We provide a simple baseline based on URL matching. Get the code from github:
git clone https://github.com/christianbuck/wmt16-document-alignment-task.git
The baseline iterates through all URLs and strips language identifiers such as /english/
or ?lang=FR
from URLs and then produces pairs of URLs that have the same stripped representation. Check test_languagestripper.py for some examples.
$ ./baseline.sh Read 222509/345106/224267 fr/en/other URLs from stdin Read 1624 url pairs from train.pairs 0 urls missing from candidate URLs 160219/229406 stripped source/target urls Found 8144 stripped source + unmodified target pairs (total: 8144) covering 107 pairs from devset Found 9119 stripped target + unmodified source pairs (total: 17263) covering 121 pairs from devset Found 126588 stripped source + stripped target pairs (total: 143851) covering 1131 pairs from devset Total: 143851 candidate pairs Keeping 119979 pairs after enforcing 1-1 rule 1103 pairs from devset Running eval Read 1624 reference pairs from train.pairs Read 119979 predicted pairs from predicted.pairs Keeping 119979 pairs after enforcing 1-1 rule Found 1103 (67.92%) pairs from reference
Release of training data | February 12, 2016 |
Release of test data | April 11, 2016 |
Results submission deadline | May 2, 2016 |
Paper submission deadline | May 15, 2016 |
Notification of acceptance | June 5, 2016 |
Camera-ready deadline | June 22, 2016 |