This corpus is outdated. Please use its successor PAN-PC-11.
The PAN plagiarism corpus 2009 (PAN-PC-09) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.
To download the corpus use the following links: (consider to use a download manager):
(0.95 GB, MD5 sum: b426e2d57a442d1119c3eb8dd481a64d),
(0.95 GB, MD5 sum: ecc755dbdb9c7599f1c7d4f842e53ec2), and
(0.38 GB, MD5 sum: bd25fb41577b7550570fa289e474c11a).
All parts are required. Inflate only the first part, the other two parts will be inflated automatically by your archiver.
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
You might also be interested the following items:
- The corpus readme file: pan-pc-09-readme.txt.
- The results of the 1st International Competition on Plagiarism Detection.
- The reference implementation of the plagiarism detection performance measures used in the above competition.
The PAN-PC-09 can be used to evaluate two retrieval tasks pertaining to automatic plagiarism detection:
- External Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and their respective source sections in the source documents.
- Intrinsic Plagiarism Detection. Given only a set of suspicious documents, the task is to identify all plagiarized sections, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.
The PAN-PC-09 contains documents in which artificial plagiarism has been inserted automatically. The plagiarism cases have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of random variables. The variables include the percentage of plagiarism in the whole corpus, the percentage of plagiarism per document, the length of a single plagiarized section, and the degree of obfuscation per plagiarized section.
A detailed description of the corpus construction can be found in the corpus readme file and in the Publications.
Previous Corpus Versions. There have been two corpus versions prior to this one. The first version was the Webis-PC-08, in which we experimented for the first time with generating plagiarism semi-automatically. The second version was developed for the 1st International Competition on Plagiarism Detection at the PAN'09 workshop, which has been released in two steps as training corpus and test corpus. Both versions are still available upon request, but we recommend to use the current version in your research.
- Martin Potthast
- Benno Stein
- Alberto Barrón-Cedeño (NLEL at Universidad Polytécnica de Valencia)
- Paolo Rosso (NLEL at Universidad Polytécnica de Valencia)
Students: Andreas Eiselt