12th Workshop on the Challenges in the Management of Large
Corpora
3rd Call for Papers and deadline extension
Important dates
-
Deadline for paper submission: the 16th
25th of February 2026 (Monday, 23:59 UTC)
- Notification of acceptance: the 12th of March 2026 (Thursday)
- Deadline for the submission of camera-ready papers: the 30th of March 2026 (Monday)
- Meeting: the 11th of May, morning slot
Paper submission
- We invite anonymised extended abstracts for oral presentations on the topics
listed above, as PDF created according to LREC-2026 templates.
- Length and content: 4 to 8 pages in length, excluding acknowledgements, references,
potential Ethics Statements and discussion on Limitations. Appendices or
supplementary material are not permitted during the initial submission
phase, as papers should be self-contained and reviewable on their own.
However, appendices and supplementary material will be allowed in the
final, camera-ready version of the paper.
- CMLC has always reserved a track for national corpus project reports, and to
this end, we invite poster proposals of 500-750 words. National project
reports need not be anonymised.
- Submissions are accepted solely through the LREC START system.
- A volume of proceedings will be published online by ELRA. Oral and poster
contributions will have equal status.
Workshop description
As in the previous CMLC meetings, we wish to explore common areas of interest across
a range of issues in language resource management, corpus linguistics, natural
language processing, natural language generation, and data science.
Large textual datasets require careful design, collection, cleaning, encoding,
annotation, storage, retrieval, and curation to be of use for a wide range of
research questions and to users across a number of disciplines. A growing number of
national and other very large corpora are being made available, many historical
archives are being digitised, numerous publishing houses are opening their textual
assets for text mining, and many billions of words can be quickly sourced from the
web and online social media.
A mixed blessing of the times is that much of those texts, in mono- and multi-lingual
arrangements can now be created automatically by exploiting Large Language Models at
various scales. That, on the one hand, makes it possible to inflate the amounts of
data where normally data would be scarce: in under-resourced languages or language
varieties, in specific genres or for intricate and rarely attested constructions. On
the other hand, such procedures immediately raise concerns regarding the
authenticity and quality of such data, casting doubt on the possibility of
adequately (truthfully, verifiably, reproducibly) addressing the kind of research
questions that provoked the rapid but tainted increase of the available data volumes
in the first place. Similar doubts may be directed at mass creation of secondary and
tertiary data ordinarily crucial for linguistic research: apart from potential legal
constraints on the use of the initial amounts of human-created data, new questions
arise as to the legal status of the derived data, the ways to create e.g. provenance
metadata of the derived resources, and the level of trust regarding mass-produced
grammatical (and other) annotation layers.
These new as well as more traditional questions lie at the base of the list of topics
that management of large corpora (for any currently suitable definition of “large”)
invokes or at least strongly brushes against.
Topics of interest
This year's event adds new items to the standard range of CMLC themes and addresses
some of LREC-2026 focus topics:
-
Interoperability and accessibility
- How to make corpora as accessible as possible
- Interoperable APIs for query and analysis software
- Provision of multiple levels of access for different tasks
Machine/Deep Learning
- Data preparation for machine learning input
- Creation, curation, maintenance and
dissemination of language models based on machine learning (e.g. word
embeddings and entire deep learning networks)
- Legal issues concerning language model distribution
Linguistic content challenges
- Dealing with the variety of language:
multilinguality, minority and/or underrepresented languages, historical
texts, noisy OCR texts, user-generated content, etc.
- Diversity and inclusion in language resources
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations
- Ensuring linguistic integrity of data
through deduplication, correction of typos and errors, removal of
incomplete or malformed sentences, and filtering harmful, offensive and
toxic content, etc.
- Integrating different linguistic data types (text, audio, video, facsimiles, experimental data, neuroimaging data, …)
Technical challenges
- Storage and retrieval solutions for large
text corpora: primary data (potentially including facsimiles, etc.),
metadata, and annotation data
- Corpus versioning and release management
- Scalable and efficient NLP tooling for
annotating and analysing large datasets: distributed and GPGPU
computing; using big data analysis frameworks for language processing
- Dealing with streaming data (e.g. Social Media) and rapidly changing corpora
- Environmental impact of big language data computing
- Engineering and management of research software
Exploitation challenges
- Legal and privacy issues
- Query languages, data models, and standardisation
- Licensing models of open and closed data, coping with intellectual property restrictions
- Innovative approaches for aggregation and visualisation of text analytics
- Repurposing or extending application areas of existing corpora and tools
National corpus initiatives
In the tradition of CMLC, we invite reports on national corpus initiatives;
submitters of these reports should be prepared to present a poster. Given that it's
been a while since the last round, we would be happy to have a little "What's the
news?" session, and we cordially invite both our veteran presenters as well as
colleagues who have not yet introduced their national corpus projects,
Our poster sessions are usually scheduled to overlap with the coffee break, to ensure
informal atmosphere and to maximally use the time slot available to us. A flash
presentation section is plan for just before the poster session: ca. 3 minutes for
the highlights.
LRE 2026 Map and the "Share your LRs!" initiative
When submitting a paper from the START page, authors will be asked to provide
essential information about resources (in a broad sense, i.e. also technologies,
standards, evaluation kits, etc.) that have been used for the work described in the
paper or are a new result of your research. Moreover, ELRA encourages all LREC
authors to share the described LRs (data, tools, services, etc.) to enable their
reuse and replicability of experiments (including evaluation ones).
Programme Committee
- Laurence Anthony (Waseda University, Japan)
- Vladimír Benko (Slovak Academy of Sciences)
- Felix Bildhauer (IDS Mannheim)
- Mark Davies (English-Corpora.org)
- Nils Diewald (IDS Mannheim)
- Kaja Dobrovoljc (University of Ljubljana / Jožef Stefan Institute)
- Jarle Ebeling (University of Oslo)
- Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
- Andrew Hardie (Lancaster University, UK)
- Serge Heiden (ENS de Lyon)
- Ulrich Heid (University of Hildesheim)
- Nancy Ide (Vassar College / Brandeis University)
- Olha Kanishcheva (Heidelberg University)
- Gražina Korvel (Vilnius University)
- Natalia Kocyba (Samsung Poland)
- Michal Křen (Charles University, Prague)
- Anna Latusek (ICS PAS, Warsaw)
- Paul Rayson (Lancaster University)
- Laurent Romary (INRIA)
- Thomas Schmidt (University of Duisburg-Essen)
- Serge Sharoff (University of Leeds)
- Maria Shvedova (Kharkiv Polytechnic Institute / University of Jena)
- Irena Spasić (Cardiff University)
- Martin Wynne (University of Oxford)
Organising Committee
- 📩 Piotr Bański (IDS Mannheim)
- 📩 Dawn Knight (Cardiff University)
- 📩 Marc Kupietz (IDS Mannheim)
- 📩 Andreas Witt (IDS Mannheim)
- 📩 Alina Wróblewska (ICS PAS, Warsaw)