Area: Data and Benchmarking
Natural Language Processing critically depends on data. Yet, most existing text collections consist of individual, isolated documents. To foster research in cross-document NLP, this area develops novel corpora and unified benchmarks for the study of interconnected, changing texts. We devise an inclusive view on text that - unlike most prior work in NLP - takes both textual and non-textual elements into account to enable efficient cross-document processing.
As NLP finds its way into real-life applications, concerns regarding the provenance, quality and legal status of data arise. Collecting interconnected, living texts is coupled with additional challenges, incl. multiple authorship, confidentiality and privacy concerns. Contributing to the growing body of research on ethics in NLP, this area puts special focus on developing general-purpose methodologies and workflows for ethics-, confidentiality- and copyright-aware data collection.