ParaCrawl

You probably came to this page because you found the Paracrawl user agent in your webserver logfiles and would like to know what exactly it is.

Paracrawl is an effort by mostly academic institutions (University of Edinburgh, Johns Hopkins University, University of Alicante) to create a large corpora of texts in many foreign languages, each paired with English translations. The primary objective of this work is to aid research in machine translation, which are typically build by automatically learning from translated texts (so-call parallel texts). Most recently, this work is supported by the European Commission.

Do you not want your pages to be indexed by Paracrawl?

This can easily be done with the robots.txt file on your server. If you don't yet know what that is, then I suggest you take a look at this page for further information.

To specifically tell the Paracrawl robot NOT to index your pages you will need to add a few lines to your robots.txt:

User-agent: Paracrawl
Disallow: /

Or if you only want certain directories not to be indexed you can do that like this:

User-agent: Paracrawl
Disallow: /NotThisOne/
Disallow: /AndNeitherThisOne/

Any changes you make to your robots.txt will not purge the corpus collection of your pages. It will only influence the currently running robot. And even then it can take up to 24 hours for the robot to re-check your robots.txt and see your changes. Only after that, will the robot be able to act on any additional "disallows" you have added to your robots.txt.

Do you have additional questions or have encountered problems with the Paracrawl robot?

If you have questions which weren't already answered in the above text, or if you problems with the robot, like for example it does something which you think it shouldn't do, please send me an email.

Contact:

Kenneth Heafield firstname . lastname @ ed.ac.uk

Philipp Koehn, Johns Hopkins University, phi@jhu.edu

We will do our best to answer your questions as soon as possible and to fix any problems with the robot as quickly as I can.