Datasets and Code

M2QA
Our brand-new large-scale multi-lingual AND multi-domain benchmark for SQuAD-style question answering.
CiteBench
Source code and data for the CiteBench: the first benchmark for citation text generation.
CARE Source
The source code for CARE: our new open-source Collaborative AI-Assisted Reading Environment. Explore the extensive documentation and try the public demo!
NLPeer
An openly-licensed, unified, multi-domain resource for the computational study of peer review. Papers, reviews and paper revisions in a unified format across a range of research communities, incl. new data from ACL and COLING review collection campaigns.
The intertext-graph library
A pre-release of our general-purpose library for cross-document NLP modelling and analysis. Current version of the library provides converters from several document formats into a uniform data model, as well as an API for common graph operations that facilitate cross-document analysis on varying granularity levels. The library is constantly extended to cover more document formats and cross-document relation types, star the repo to stay up-to-date with the new releases!
F1000Research Discourse Corpus
The first openly licensed cross-document corpus in multi-domain, open, journal-style peer review. The corpus consists of peer review reports, manuscripts and their revisions in a wide range of domains. We provide multiple annotation layers that cover one full revise-and-resubmit cycle of academic publishing: pragmatic tagging determines the intention of reviewers' comments, linking connects peer reviews to their manuscripts, version alignment maps manuscripts to their next revision triggered by the peer review.
3Y Data Collection Implementation
An open implementation of a peer reviewing data collection workflow for OpenReview.net. The code can be used to set up a licensing workflow for peer review data and paper drafts submitted to OpenReview-based venues. We provide the implementation for creating license tasks for reviewers and authors of selected submissions, as well as the code for retrieving the peer reviewing data in a privacy- and anonymity-aware fashion.
ACL-2018 Review Corpus
A corpus of anonymised structured peer reviews collected during the ACL-2018 reviewing campaign. ACL-2018 employed a rich reviewing schema, with each review containing a wide range of textual, binary, ternary and numerical fields, including Strengths, Weaknesses, Summary, aspect scores, overall score and confidence scores. While openly publishing the textual data is not possible due to the ethical concerns, we make numerical data publicly available to support meta-scientific study of peer reviewing in the NLP community.