Area: Cross-Document NLP

Static, isolated, short texts have been the main focus of NLP research to date. However, many real-world tasks require humans to simultaneously work with multiple, connected, potentially long documents that change over time: from collaborative writing to fake news detection, and from peer review to social media management. While isolated, application-specific approaches to cross-document discourse modeling exist, the general NLP methodology for cross-document analysis is yet to be established.

Instead of treating each application scenario separately, the InterText initiative develops a joint, unified framework for cross-document modeling. Inspired by the theoretical works in literary and discourse studies, we propose a typology of general cross-document relations that might differ by their type, granularity and explicitness, aiming to cover a wide range of cross-document discourse phenomena. We use this typology to model cross-document discourse in diverse application scenarios.

Publications

Jun 2024

Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
Qian Ruan, Ilia Kuznetsov, Iryna Gurevych (2024)
🔥 arXiv, accepted at ACL-2024 [paper]
[bibTex] [plain]

Apr 2024

Document Structure in Long Document Transformers
Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov, Iryna Gurevych (2024)
EACL-2024 [paper] [repo]
[bibTex] [plain]

Dec 2023

CiteBench: A benchmark for Scientific Citation Text Generation
Martin Funkquist, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych (2023)
EMNLP-2023 [paper] [repo]
[bibTex] [plain]

Dec 2023

⤴️ Exploring Jiu-Jitsu Argumentation for Writing Peer Review Rebuttals
Sukannya Purkayastha, Anne Lauscher, Iryna Gurevych (2023)
EMNLP-2023 [paper] [repo]
[bibTex] [plain]

Jul 2023

NLPeer: A Unified Resource for the Computational Study of Peer Review
Nils Dycke, Ilia Kuznetsov, Iryna Gurevych (2023)
ACL-2023 [paper] [repo]
[bibTex] [plain]

Jul 2023

An Inclusive Notion of Text
Ilia Kuznetsov, Iryna Gurevych (2023)
ACL-2023 [paper]
[bibTex] [plain]

Jul 2022

Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review
Ilia Kuznetsov, Jan Buchmann, Max Eichler, Iryna Gurevych (2022)
Computational Linguistics, 48(4) [paper] [repo]
[bibTex] [plain]

May 2022

Assisting Decision Making in Scholarly Peer Review: A Preference Learning Perspective
Nils Dycke, Edwin Simpson, Ilia Kuznetsov, Iryna Gurevych (2022)
🔥 arXiv [paper]
[bibTex] [plain]

Nov 2019

Does My Rebuttal Matter? Insights from a Major NLP Conference
Yang Gao, Steffen Eger, Ilia Kuznetsov, Iryna Gurevych, Yusuke Miyao (2019)
NAACL [paper] [repo]
[bibTex] [plain]

Datasets and Code

F1000Research Discourse Corpus
The first openly licensed cross-document corpus in multi-domain, open, journal-style peer review. The corpus consists of peer review reports, manuscripts and their revisions in a wide range of domains. We provide multiple annotation layers that cover one full revise-and-resubmit cycle of academic publishing: pragmatic tagging determines the intention of reviewers' comments, linking connects peer reviews to their manuscripts, version alignment maps manuscripts to their next revision triggered by the peer review.
The intertext-graph library
A pre-release of our general-purpose library for cross-document NLP modelling and analysis. Current version of the library provides converters from several document formats into a uniform data model, as well as an API for common graph operations that facilitate cross-document analysis on varying granularity levels. The library is constantly extended to cover more document formats and cross-document relation types, star the repo to stay up-to-date with the new releases!
NLPeer
An openly-licensed, unified, multi-domain resource for the computational study of peer review. Papers, reviews and paper revisions in a unified format across a range of research communities, incl. new data from ACL and COLING review collection campaigns.
CiteBench
Source code and data for the CiteBench: the first benchmark for citation text generation.