The InterText Initiative, UKP Lab

Area: Cross-Document NLP

Static, isolated, short texts have been the main focus of NLP research to date. However, many real-world tasks require humans to simultaneously work with multiple, connected, potentially long documents that change over time: from collaborative writing to fake news detection, and from peer review to social media management. While isolated, application-specific approaches to cross-document discourse modeling exist, the general NLP methodology for cross-document analysis is yet to be established.

Instead of treating each application scenario separately, the InterText initiative develops a joint, unified framework for cross-document modeling. Inspired by the theoretical works in literary and discourse studies, we propose a typology of general cross-document relations that might differ by their type, granularity and explicitness, aiming to cover a wide range of cross-document discourse phenomena. We use this typology to model cross-document discourse in diverse application scenarios.

Publications

Jul 2025

STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond
Nils Dycke, Matej Zečević, Ilia Kuznetsov, Beatrix Suess, Kristian Kersting, Iryna Gurevych (2025)
ACL-2025 [paper] [repo]
[bibTex] [plain]

Apr 2025

⤴️ Grounding Fallacies Misrepresenting Scientific Publications in Evidence
Max Glockner, Yufang Hou, Preslav Nakov, Iryna Gurevych (2025)
NAACL-2025 [paper] [repo]
[bibTex] [plain]

Apr 2025

⤴️ COVE: COntext and VEracity prediction for out-of-context images
Jonathan Tonglet, Gabriel Thiem, Iryna Gurevych (2025)
NAACL-2025 [paper] [repo]
[bibTex] [plain]

Dec 2024

Attribute or Abstain: Large Language Models as Long Document Assistants
Jan Buchmann, Xiao Liu, Iryna Gurevych (2024)
EMNLP-2024 [paper] [repo]
[bibTex] [plain]

Nov 2024

⤴️ “Image, Tell me your story!” Predicting the original meta-context of visual misinformation
Jonathan Tonglet, Marie-Francine Moens, Iryna Gurevych (2024)
EMNLP-2024 [paper] [repo]
[bibTex] [plain]

Oct 2024

Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions
Qian Ruan, Ilia Kuznetsov, Iryna Gurevych (2024)
EMNLP-2024 [paper] [repo]
[bibTex] [plain]

Jul 2024

Systematic Task Exploration with LLMs: A Study in Citation Text Generation
Furkan Şahinuç, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych (2024)
ACL-2024 [paper] [repo]
[bibTex] [plain]

Jul 2024

Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
Qian Ruan, Ilia Kuznetsov, Iryna Gurevych (2024)
ACL-2024 [paper] [repo]
[bibTex] [plain]

Jul 2024

⤴️ HDT: Hierarchical Document Transformer
Haoyu He, Markus Flicke, Jan Buchmann, Iryna Gurevych, Andreas Geiger (2024)
COLM-2024 [paper] [repo]
[bibTex] [plain]

Jun 2024

⤴️ Missci: Reconstructing Fallacies in Misrepresented Science
Max Glockner, Yufang Hou, Preslav Nakov, Iryna Gurevych (2024)
ACL-2024 [paper] [repo]
[bibTex] [plain]

Apr 2024

Document Structure in Long Document Transformers
Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov, Iryna Gurevych (2024)
EACL-2024 [paper] [repo]
[bibTex] [plain]

Dec 2023

⤴️ Exploring Jiu-Jitsu Argumentation for Writing Peer Review Rebuttals
Sukannya Purkayastha, Anne Lauscher, Iryna Gurevych (2023)
EMNLP-2023 [paper] [repo]
[bibTex] [plain]

Dec 2023

CiteBench: A benchmark for Scientific Citation Text Generation
Martin Funkquist, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych (2023)
EMNLP-2023 [paper] [repo]
[bibTex] [plain]

Jul 2023

An Inclusive Notion of Text
Ilia Kuznetsov, Iryna Gurevych (2023)
ACL-2023 [paper]
[bibTex] [plain]

Jul 2023

NLPeer: A Unified Resource for the Computational Study of Peer Review
Nils Dycke, Ilia Kuznetsov, Iryna Gurevych (2023)
ACL-2023 [paper] [repo]
[bibTex] [plain]

Jul 2022

Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review
Ilia Kuznetsov, Jan Buchmann, Max Eichler, Iryna Gurevych (2022)
Computational Linguistics, 48(4) [paper] [repo]
[bibTex] [plain]

May 2022

Assisting Decision Making in Scholarly Peer Review: A Preference Learning Perspective
Nils Dycke, Edwin Simpson, Ilia Kuznetsov, Iryna Gurevych (2022)
arXiv [paper]
[bibTex] [plain]

Nov 2019

Does My Rebuttal Matter? Insights from a Major NLP Conference
Yang Gao, Steffen Eger, Ilia Kuznetsov, Iryna Gurevych, Yusuke Miyao (2019)
NAACL [paper] [repo]
[bibTex] [plain]

Datasets and Code

F1000Research Discourse Corpus
The first openly licensed cross-document corpus in multi-domain, open, journal-style peer review. The corpus consists of peer review reports, manuscripts and their revisions in a wide range of domains. We provide multiple annotation layers that cover one full revise-and-resubmit cycle of academic publishing: pragmatic tagging determines the intention of reviewers' comments, linking connects peer reviews to their manuscripts, version alignment maps manuscripts to their next revision triggered by the peer review.

[link]
[paper]

The intertext-graph library
A pre-release of our general-purpose library for cross-document NLP modelling and analysis. Current version of the library provides converters from several document formats into a uniform data model, as well as an API for common graph operations that facilitate cross-document analysis on varying granularity levels. The library is constantly extended to cover more document formats and cross-document relation types, star the repo to stay up-to-date with the new releases!

[link]
[paper]

Re3 Corpus
The first large-scale manually labeled corpus of document-level edits in the scholarly domain.

[link]
[paper]

NLPeer
An openly-licensed, unified, multi-domain resource for the computational study of peer review. Papers, reviews and paper revisions in a unified format across a range of research communities, incl. new data from ACL and COLING review collection campaigns.

[link]
[paper]

LAB
A new six-task benchmark to study long-document attribution.

[link]
[paper]

CiteBench
Source code and data for the CiteBench: the first benchmark for citation text generation.

[link]
[paper]

NLPeer 2
A brand-new, high-coverage, data-rich, clearly licensed dataset of papers, peer reviews, rebuttals and meta-reviews from the ACL community and beyond. Your one-stop-shop for empirical study of peer reviewing, reviewing assistance, edit analysis, and many other exciting problems. Learn more here.

[link]
[paper]