- Corpora - ELRA lists

[CfP] CHOMPS Shared Task: SHROOM-CAP, the Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications
by Federica Gamba 21 Aug '25

21 Aug '25

TL;DR SHROOM-CAP is an Indic-centric shared task co-located with CHOMPS-2025 to advance the SOTA in hallucination detection for scientific content generated with LLMs. We have annotated hallucinated content in 4* high-resource languages and surprisal 3* low-resource Indic languages using top-tier LLMs. Participate in as many languages as you like by accurately detecting the presence of hallucinated content. Stay informed by joining our Google group ! Full Invitation We are excited to announce the SHROOM-CAP shared task on cross-lingual hallucination detection for scientific publication (link to website). We invite participants to detect whether or not there is hallucination in the outputs of instruction-tuned LLMs within a cross-lingual scientific context. About This shared task builds upon our previous iteration, SHROOM, with three key highlights: LLM-centered, cross-lingual annotations & hallucination and fluency prediction. LLMs frequently produce "hallucinations," where models generate plausible but incorrect outputs, while the existing metrics prioritize fluency over correctness. This results in an issue of growing concern as these models are increasingly adopted by the public. With SHROOM-CAP, we want to advance the state-of-the-art in detecting hallucinated scientific content. This new iteration of the shared task is held in a cross-lingual and multimodel context: we provide data produced by a variety of open-weights LLMs in 4*+3* different high and low resource languages (English, French, Spanish, Hindi, and to-be-later-revealed Indic languages). Participants are invited to participate in any of the languages available and are expected to develop systems that can accurately identify hallucinations in generated scientific content. Additionally, participants will also be invited to submit system description papers, with the option to present them in oral/poster format during the CHOMPS workshop (collocated with IJCNLP-AACL 2025, Mumbai, India). Participants that elect to write a system description paper will be asked to review their peers’ submissions (max 2 papers per author). Key Dates: All deadlines are “anywhere on Earth” (23:59 UTC-12). - Dev set available by: 31.07.2025 - Test set available by: 05.10.2025 - Evaluation phase ends: 15.10.2025 - System description papers due: 25.10.2025 (TBC) - Notification of acceptance: 05.11.2025 (TBC) - Camera-ready due: 11.11.2025 (TBC) - Proceedings due: 01.12.2025 (TBC) - CHOMPS workshop: 23/24th December 2025 (co-located with IJCNLP-AACL 2025) Evaluation Metrics: Participants will be ranked along two criteria: 1. factuality mistakes measured via macro-F1 gold reference vs. predicted; 2. fluency mistakes measured via macro-F1 gold reference vs. predicted based on our annotations. Rankings and submissions will be done separately per language: you are welcome to focus only on the languages you are interested in! How to Participate: - Register: Please register your team https://forms.gle/hWR9jwTBjZQmFKAE7 and join our google group: https://groups.google.com/g/shroomcap - Submit results: use our platform to submit your results before 15.10.2025 - Submit your system description: system description papers should be submitted by 25.10.2025 (TBC, further details will be announced at a later date). Want to be kept in the loop? Join our Google group mailing list! We look forward to your participation and to the exciting research that will emerge from this task. Best regards, SHROOM-CAP organizers

1 0

Two public online talks in August and September on Pre-trained Language Models / Foundation Models (NLPV seminars, Exeter)
by h.dong2＠exeter.ac.uk 20 Aug '25

20 Aug '25

We welcome you to the next Natural Language Processing and Vision (NLPV) seminars at the University of Exeter. Talk 1 Scheduled: Thursday 21 Aug 2025 at 16:00 to 17:00, GMT+1 Location: https://Universityofexeter.zoom.us/j/97587944439?pwd=h4rnPO0PafT9oRrrqQsezG… (Meeting ID: 975 8794 4439 Password: 064414) Title: Trustworthy Optimization of Pre-Trained Models for Healthcare: Generalizability, Adaptability, and Security Abstract: Pre-trained language models have opened new possibilities in healthcare, showing promise in mining scientific literature, analyzing large-scale clinical data, identifying patterns in emerging diseases, and automating workflows, positioning themselves as intelligent research assistants. However, general-purpose models, typically trained on web-scale corpora, often lack the clinical grounding necessary for reliable deployment in high-stakes domains like healthcare. To be effective, they must be adapted to meet domain-specific requirements. My PhD thesis addresses three core challenges in leveraging pre-trained models for healthcare: (i) the scarcity of labeled data for fine-tuning, (ii) the evolving nature of healthcare data, and (iii) the need to ensure transparency and traceability of AI-generated content. In this talk, I will focus on the third challenge: enabling traceability of content generated by large language models. I will begin with an overview of prior watermarking approaches and then present our proposed solution. We introduce a watermarking algorithm applied at inference time that perturbs the model’s logits to bias generation toward a subset of vocabulary tokens determined by a secret key. To ensure that watermarking does not compromise generation quality, we propose a multi-objective optimization (MOO) framework that employs lightweight networks to produce token-specific watermarking logits and splitting ratios, specifying how many tokens to bias and by how much. This approach effectively balances watermark detectability with semantic coherence. Experimental results show that our method significantly improves detectability and robustness against removal attacks while preserving the semantics of the generated text, outperforming existing watermarking techniques. Speaker's bio: Dr. Sai Ashish Somayajula is a Senior Applied Scientist in Generative AI at Oracle Cloud Infrastructure, where he develops large-scale foundation models for enterprise applications. He earned his PhD in Electrical and Computer Engineering from the University of California (UC), San Diego. His research focused on addressing key challenges in adapting and utilizing pre-trained models for healthcare. Specifically, his work spanned three core areas: (1) synthetic data generation using meta-learning-based feedback mechanisms, (2) continual learning for handling dynamic data streams without catastrophic forgetting, and (3) token-level watermarking techniques to ensure content provenance and security. His research has been published in premier venues, including the International Conference on Machine Learning (ICML), Annual Meeting of the Association for Computational Linguistics (ACL), Transactions of the Association for Computational Linguistics (TACL), Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Scientific Reports (Nature Portfolio), and Transactions of Machine Learning Research (TMLR). He is a recipient of the Jacobs School of Engineering Departmental Fellowship at UC San Diego. Ashish has collaborated with leading industrial research labs through internships at Apple and Tencent AI Lab. He holds a Bachelor's degree in Electrical Engineering with a minor in Computer Science from the Indian Institute of Technology, Hyderabad, where he was twice awarded the Academic Excellence Award, and a Master’s in Intelligent Systems and Robotics from UC San Diego. Talk 2 Scheduled: Thursday 4 Sep 2025 at 13:00 to 14:00, GMT+1 Location: https://Universityofexeter.zoom.us/j/95827730937?pwd=Te1wejfgr68A5lplwLQjxw… (Meeting ID: 958 2773 0937 Password: 879296) Title: Towards end-to-end tokenization and adaptive memory in foundation models Abstract: Foundation models (FMs) process information as a sequence of internal representations; however, the length of this sequence is fixed and entirely determined by tokenization. This essentially decouples representation granularity from information content, which exacerbates the deployment costs of FMs and narrows their “horizons” in long sequences. What if, instead, we could free FMs from tokenizers by modelling bytes directly, while making them faster than current tokenizer-bound FMs? I argue that a recipe to achieve this goal already exists. In particular, I helped prototype how to: 1) dynamically pool representations in internal layers, progressively learning abstractions from raw data; 2) compress the KV cache of Transformers during generation without loss of performance; 3) predict multiple bytes per time step in an efficient yet expressive way; 4) retrofit existing tokenizer-bound FMs into byte-level FMs through cross-tokenizer distillation. By blending these ingredients, we may soon witness the emergence of efficient byte-level FMs. Speaker's short bio (based on website): Edoardo Ponti is an assistant professor in Natural Language Processing at the University of Edinburgh and a visiting professor at NVIDIA. His research focuses on efficient architectures (see NeurIPS 2024 tutorial on dynamic sparsity), modular deep learning (designing neural architectures that route information to specialised modules, e.g., sparse subnetworks), and computational typology (understand how languages vary, across the world and its cultures, within a computational and mathematical framework). Previously, Edorado was a visiting postdoctoral scholar at Stanford University and a postdoctoral fellow in computer science at Mila - Quebec AI Institute in Montreal. In 2021, Edorado obtained a PhD from the University of Cambridge, St John’s College. Once upon a time Edorado studied typological and historical linguistics at the University of Pavia. Edoardo’s research has been featured on the Economist and Scientific American, among others. Edoardo received a Google Research Faculty Award and 2 Best Paper Awards at EMNLP 2021 and RepL4NLP 2019. Edoardo is a board member of SIGTYP, the ACL special interest group for computational typology, a Scholar of the European Lab for Learning and Intelligent Systems (ELLIS), and part of the TACL journal editorial team. We will update future talks at the website: https://sites.google.com/view/neurocognit-lang-viz-group/seminars Joining our *Google group* for future seminar and research information: https://groups.google.com/g/neurocognition-language-and-vision-processing-g…

1 0

RANLP 2025 TUTORIALS (6-7 September): Call for participation
by Petya Osenova 19 Aug '25

19 Aug '25

RANLP 2025 TUTORIALS (6-7 September) Call for Participation Website - https://ranlp.org/ranlp2025/index.php/tutorials/ RANLP 2025 belongs to a sequence of events with similar name and continues the tradition of successful training events that were held in Bulgaria since 1989. RANLP 2025 plans 4 half-day tutorials, each with duration of 185 minutes, distributed as follows: 45 min presentation + 20 min break + 45 min presentation + 30 min coffee break + 45 min presentation. Tutorial Presenters * Burcu Can Buglalilar (University of Sterling, UK) * Salima Lamsiyah (University of Luxembourg, Luxembourg) * Tharindu Ranasinghe and Damith Dola Mullage Premasiri (Lancaster University, UK) * Anna Rogers and Max Müller-Eberstein (IT University of Copenhagen, Denmark) Programme 6th September 2025, 9am Tharindu Ranasinghe and Damith Premasiri: NLP in the LLM era This tutorial examines the transformation of Legal NLP in the era of large language models, beginning with key principles of task formulation and data preparation. We will discuss retrieval and judgment prediction in detail, exploring their methodologies, challenges, and applications in legal contexts. We conclude with a forward-looking discussion on the future of Legal AI and the ethical considerations surrounding its applications in the practice of law. 6th September 2025, 2pm Burcu Can Buglalilar: From Large to Small: Building Affordable Language Models with Limited Resources This tutorial aims to question the limitations and harms of Large Language Models, followed by a comprehensive review of Small Language Models, covering prominent examples, their key techniques, and their capabilities. It will also give an overview of even smaller ‘baby’ language models. Finally, the tutorial will conclude by presenting some recent studies in which we developed baby language models using a very small amount of data. 7th September 2025, 9am Anna Rogers and Max Müller-Eberstein: Studying Generalization in the Age of Contamination The tutorial will discuss the challenges of doing NLP research in the age of LLMs, when we can no longer be sure that the test data was not observed in training. We will cover the main approaches to studying generalization in various settings, and present a new framework for working with controlled test-train splits across linguistically annotated data at scale. 7th September 2025, 2pm Salima Lamsiyah: AI Content in NLP: Trends, Detection, and Applications This tutorial provides a comprehensive overview of AI-generated content in Natural Language Processing (NLP). It covers recent trends in text generation, methods for detecting AI-generated text, and practical applications of such content. The content includes an exploration of state-of-the-art models and techniques for text generation, approaches to identifying machine-generated text, a review of key benchmarks and datasets, and a discussion of open research challenges. We are looking forward to your participation! The organisers of RANLP 2025

1 0

[CfP] ECIR 2026: 2nd call – Workshop Proposals (Sep 12)
by Yifei Yuan 18 Aug '25

18 Aug '25

Call for Workshop Proposals ECIR 2026 (https://ecir2026.eu/) workshops provide a platform for presenting novel ideas and research results in emerging areas in IR in a focused and interactive way. Workshops can be either a half-day (3.5 hours plus breaks) or a full day (7 hours plus breaks). The organizers of approved workshops are expected to set up a webpage for the workshop, disseminate the call for papers and the call for participation, gather and review submissions, and prepare the final program. A camera-ready summary of the workshop, written by the organizers, will be included in the ECIR conference proceedings. Workshops are encouraged to be as dynamic and interactive as possible and should lead to a concrete outcome, such as the publication of workshop proceedings. Organizers are also encouraged to write a summary article for the June edition of the ACM SIGIR Forum, highlighting the main results of the workshop. Workshops are on site, and at least one organizer is expected to attend the workshop. Topics of Interest We welcome submissions on any topic relevant to the general field of Information Retrieval, including those mentioned in the Call for Full papers for ECIR 2026. Submission Guidelines Workshop proposals should contain the following information: Title and abstract of the workshop; Motivation and relevance to ECIR; Workshop goals/objectives and overall vision, coupled with desired outcomes; Format and Structure, in particular, duration of the workshop (full-day or half-day workshop); mention to the type of papers (e.g., full papers, demo papers, negative papers, etc); type of presentation (e.g., oral; poster, etc); and proceedings (e.g., CEUR; Special Issue, etc); planned activities, the tentative schedule of events etc.; resources needed to deliver the workshop (e.g., poster boards, etc); Intended audience, including number of expected participants and how they will be selected/invited; List of organizers with a brief bio highlighting the relevance of their expertise to the workshop topics Names of potential programme committee members, invited speakers, etc Indicate if the workshop is related to or follows on from another workshop; if so, please, identify which conference it was previously held at, the past attendance and outcomes, and why another workshop is needed; Any other relevant information to support your proposal. Workshop proposals should be prepared using Springer proceedings templates available on the Springer webpage, with a maximum length of 8 pages. All proposals must be in English and will be submitted electronically through the conference submission system. Workshop proposals will be reviewed by the ECIR 2026 workshop committee based on the quality of their proposal, covered topics, relationship to ECIR, and likelihood of attracting participants. The ECIR workshop co-chairs will make final decisions. Springer webpage: https://www.springer.com/gp/computer-science/lncs/conference-proceedings-gu… Submission page: EasyChair submission page: https://easychair.org/conferences/?conf=ecir2026 Ethics and Professional Conduct ECIR 2026 expects authors (as well as the PC, and the organising committee) to adhere to accepted standards on ethics and professionalism in our community, namely: The ACM’s Policy on Authorship, The ACM’s Code of Ethics and Professional Conduct, The ACM’s Conflict of Interest Policy, The ACM’s Policy on Plagiarism, Misrepresentation, and Falsification, The ACM’s Policy Against Harassment Workshop Proposals Track Dates Workshop proposals submission: September 12, 2025, 11:59pm (AoE) Workshop proposals notification: October 17, 2025 Workshop day: April 02, 2026 Workshop Proposals Track Chairs Negar Arabzadeh (UC Berkeley) Franco Maria Nardini (ISTI-CNR, Pisa, Italy) Contact: ecir2026-workshops(a)easychair.org

1 0

[CfP] ECIR 2026: 1st call – Short Papers (Abstract Oct 7)
by yuanyif＠ethz.ch 18 Aug '25

18 Aug '25

Call for Short Papers The European Conference on Information Retrieval (ECIR) is the prime European forum for the presentation of original research in the field of Information Retrieval. The 48th European Conference on Information Retrieval (ECIR 2026) will take place as a physical (in-person) conference from 29 March to 2 April 2026 in Delft, The Netherlands. Topics of Interest The Short Paper Track calls for original contributions presenting novel, thought-provoking ideas and addressing innovative application areas within the field of Information Retrieval, including those mentioned in the Call for Full papers for ECIR 2026 Short papers differ from full papers in that they present innovative new works, but may be narrower in scope or applications. Submissions may include preliminary ideas, but still should provide empirical or theoretical validation. Papers that stimulate discussion are particularly encouraged. • Short papers are up to 6 pages in length, plus additional pages for references. Appendices count toward the page limit. Please put appendices before the references for paper submission. • Short papers will be refereed through double-anonymous peer review. This means that all submitted papers must be fully anonymised. Submission Guidelines Authors should consult Springer's authors' guidelines and use their proceedings templates, either for LaTeX or for Word (to be found at https://www.springer.com/gp/computer-science/lncs/conference-proceedings-gu…), for the preparation of their papers. Springer encourages authors to include their ORCIDs in their papers (https://www.springer.com/gp/authors-editors/orcid). All submissions must be written in English. All papers should be submitted electronically through the EasyChair submission system: https://easychair.org/conferences/?conf=ecir2026. In addition, the corresponding author of each accepted paper, acting on behalf of all of the authors of that paper, must complete and sign a Consent-to-Publish form. The corresponding author signing the copyright form should match the corresponding author marked on the paper. Once the paper has been submitted, changes relating to its authorship cannot be made. Accepted papers will be published in the conference proceedings in the Springer Lecture Notes in Computer Science series. The proceedings will be distributed to all delegates at the conference. Accepted papers will have to be presented at the conference by one of the authors in person, and at least one author for each accepted contribution will be required to register and attend. Dual Submission Policy Papers submitted to ECIR 2026 should be substantially different from papers that have been previously published, or accepted for publication, or that are under review at other venues. Exceptions to this rule are: Submission is permitted for papers presented or to be presented at conferences or workshops without proceedings. Submission is permitted for papers that have previously been made available as a technical report (e.g., in institutional archives or preprint archives like arXiv). Please do not cite your technical report and make an effort to avoid any issues that may harm the anonymity of your submission. Reviewers will receive guidance that asks them to refrain from trying to break the anonymity, but be aware that the availability of an available technical report for an ECIR submission might cause some issues. Ethics and Professional Conduct ECIR 2026 expects authors (as well as the PC and the organising committee) to adhere to accepted standards on ethics and professionalism in our community, namely: • The ACM's Policy on Authorship, • The ACM's Code of Ethics and Professional Conduct, • The ACM's Conflict of Interest Policy, • The ACM's Policy on Plagiarism, Misrepresentation, and Falsification, • The ACM's Policy Against Harassment Short Paper Track Dates • Short paper abstract submission: October 7, 2025, 11:59pm (AoE) • Short paper submission: October 14, 2025, 11:59pm (AoE) • Short paper notification: December 16, 2025 (AoE) • Main conference: March 30 – April 1, 2026 Short Paper Track Chairs • Sean McAvaney (University of Glasgow, UK) • Mohammad Aliannejadi (University of Amsterdam, The Netherlands) • Christine Bauer (University of Salzburg, Austria) • Contact: ecir2026-short AT easychair.org

1 0

Final CfP Historical Languages and AI (Berlin, Germany)
by Daidalos-Projekt (Konstantin Schulz) 18 Aug '25

18 Aug '25

Call for Papers: Historical Languages and AI See the online version at https://daidalos-projekt.de/conference/cfp/ . March 5-6, 2026 The intersection of historical languages and artificial intelligence (AI) presents a rich and dynamic field of study, with the potential to revolutionize our understanding of the past and the ways in which we engage with historical texts. As digital technologies continue to advance, the need for interdisciplinary collaboration becomes increasingly apparent. The upcoming 2-day international conference on “Historical Languages and AI” aims to foster this collaboration by bringing together experts from computational literary studies, digital history, linguistics, and other domains that work with historical languages such as Latin. The conference seeks to address the growing demand for innovative methods and tools that can enhance the analysis, preservation, and interpretation of historical languages. By leveraging AI technologies, researchers can unlock new insights into historical texts, improve the accuracy of translations, and develop more effective teaching methods for historical languages. The conference will provide a platform for scholars to share their latest findings, discuss emerging trends, and explore the practical applications of AI in historical language research. It explicitly includes historical stages of modern languages, such as Old English or Early New High German. The conference is hosted by the Daidalos research project (Humboldt University Berlin, 2023-2026; https://daidalos-projekt.de ). The project is building a research infrastructure for methods of natural language processing (NLP). The target group is literary scholars in classical philology and related disciplines. The research infrastructure consists, on the one hand, of an interactive website on which interested parties can apply NLP methods to text corpora. On the other hand, the Daidalos project sees itself as a contact point for interested researchers. In this function, the project regularly invites researchers to workshops (https://daidalos-projekt.de/workshops), advises them within the framework of research tandems (https://daidalos-projekt.de/tandems), and provides materials for further training (https://daidalos-projekt.de/jupyterlite). Conference Dates: March 5-6, 2026 Venue: Humboldt-Universitaet zu Berlin (Berlin, Germany) Unfortunately, we cannot offer travel bursaries. Attending the conference itself is free of charge. Topics of Interest We welcome submissions on a wide range of topics related to historical languages and AI, including but not limited to: Machine Learning Large Language Models / Large Action Models Usage for data modeling or corpus construction Challenges in low-resource scenarios Neural machine translation for historical texts Innovative approaches to historical language analysis Linguistic analysis for literary studies Part-of-speech tagging Topic modeling Sentiment analysis Named entity recognition Word embeddings Multilingual Information Retrieval, incl. cross-lingual embeddings Evaluation of AI-driven methods and datasets Frameworks for mapping research questions to relevant AI models and methods Assessment of AI tools in historical language studies Technical Infrastructure for Research & Teaching Integrating technologies like Jupyter Notebooks into larger software platforms Retrieval-augmented generation for domain-specific chatbots Teaching & Learning Digital Literacies, incl. open educational resources for teaching natural language processing Important Dates Submission Deadline: September 1, 2025 Notification of Acceptance: October 15, 2025 Camera-Ready Submission: January 31, 2026 Conference Dates: March 5-6, 2026 Submission types Included in the open-access proceedings: *Long papers*: up to 4000 words (ca. 8 pages, excl. bibliography and appendix). Long papers report on original and unpublished results. Long papers are presented as oral presentations (30 min talk + 15 min discussion). We welcome the use of appendices or other supplementary information. Published only in the book of abstracts in our Zenodo Community: *Short papers*: up to 2000 words (ca. 4 pages, excl. bibliography and appendix). Short papers report on focused contributions, and may present work in progress. Short papers are presented as short oral presentations (20 min talk + 10 min discussion). We welcome the use of appendices or other supplementary information. *Pitch Your Research Idea*: Submit an abstract of up to 200 words (excl. bibliography and appendix) to give a 5-minute presentation during a pitch session. The presentations are followed by a Scientific Speed Dating Session and enable researchers to get in touch faster.Long papers Workshops (90 min): Submit a proposal for your intended workshop of up to 750 words. Workshops should be organized as hands-on research or learning opportunity. The workshops will take place on the second day of the conference (March 6, 2026). Workshop proposals should describe: the aims and setup of the workshop, the academic background for the work, an outline of the workshop, including the types of activities, the expected key outcomes, a short bio of each organizer or presenter, including their name, affiliation, email address, a plan for promoting the workshop to attract participants, specific requirements, including but not limited to special equipment (e.g., audio/video), software, physical space arrangements, any technical knowledge, skills, or experience participants should have before attending the workshop. Submission Guidelines and Participation All submissions must be in English or German. Papers should be formatted according to the conference template: Template of the Association for Computational Linguistics (https://github.com/acl-org/acl-style-files). It supports both Microsoft Word and LaTeX. Submissions will be peer-reviewed by the organizers. Papers should be submitted as PDF documents via E-Mail: daidalos-projekt(a)hu-berlin.de At least one author of each accepted submission must register to the conference and present the paper. Proceedings of the conference will be published as a Propylaeum eBook in the Digital Classics Books series (for long papers; https://books.ub.uni-heidelberg.de/propylaeum/catalog/series/dcb) and on Zenodo (for all other submissions; https://zenodo.org/communities/daidalos). Hybrid conference: All paper presentations will be broadcast live. Presenters can choose to participate remotely or on-site. On-site attendance is required to participate in the more interactive activities of the conference, e.g. workshops. Contact Information For any inquiries, please contact the conference organizers at daidalos-projekt(a)hu-berlin.de . We look forward to receiving your submissions and welcoming you to the International Conference on Historical Languages and AI! The Conference *Organizing Committee* of the Daidalos project: Andrea Beyer, Konstantin Schulz, Anke Lüdeling, Florian Kotschka, Florian Deichsler, Malte Dreyer

1 0

PhD grant on Analysing Clinical Documents to Support Decision Making Processes in Emergency Departments
by Alberto Lavelli 17 Aug '25

17 Aug '25

*Analysing Clinical Documents to Support Decision Making Processes in Emergency DepartmentsDeadline for application: August 26 2025, 13:00 CEST* One three-year PhD grant on Analysing Clinical Documents to Support Decision Making Processes in Emergency Departments is offered by the Doctoral Program in Brain, Mind & Computer Science (BMCS, http://hit.psy.unipd.it/BMCS) at the University of Padua, jointly with the Natural Language Processing research unit (https://nlplab.fbk.eu/) at Fondazione Bruno Kessler (Trento, Italy), where most of the research activities will be conducted. The language of the PhD programme is English. The deadline for application is: August 26 2025, 13:00 CEST For more information, the call, and applications look at: http://hit.psy.unipd.it/BMCS/admission The candidate will have the unique opportunity to explore different fields (Natural Language Processing, Machine Learning, Health & Well-Being) being directly coached by very experienced teammates. The involved PhD will work in an international environment at Fondazione Bruno Kessler (Trento, Italy). This PhD grant intends to exploit the capacity of Large Language Models (LLMs) to interpret the content of clinical documents produced in Emergency Departments (EDs) of hospitals in order to improve service quality for patients. The final goal of the project is to advance into the integration of generative AI models into healthcare, improving their alignment with the clinical expertise and the processes in EDs. The major context of the PhD will be the Horizon project eCREAM ( ecreamproject.eu/), where, through active scientific protocols, several EDs of different EU countries are involved. On the one hand, the project will take advantage of LLMs for automatic filling of Case Report Forms from anonymized clinical notes in several languages. On the other hand, the reasoning capacities of LLMs will then be applied to the extracted information to derive statistical analysis that helps decision makers for better process efficiency. The adoption of LLMs in the clinical field raises a number of research challenges, which will be addressed during the PhD. Such challenges include improving accuracy of performance, interpretability of decisions in classification tasks, coherence of reasoning capacity, mitigating the existence of biases, and risks related to data security. Fondazione Bruno Kessler is an internationally well-known research center, whose information technology department ranks first among the Engineering and Information Science research centers in Italy. The Natural Language Processing research unit (https://nlplab.fbk.eu/) is an internationally well known research group focused on text mining (information extraction and ontology population from text, analysis of the sentiment and of the emotional content of texts); conversational agents (task oriented dialogue systems, question answering, generation of persuasive messages); and development of linguistic resources, particularly for the Italian language. To get in contact with the NLP research unit and discuss about the opportunities of this call, contact Bernardo Magnini (magnini(a)fbk.eu) The Doctoral Program in Brain, Mind & Computer Science (BMCS) emerges from the close collaboration between faculty from psychology, cognitive neuroscience and information science around the unifying topic of human-computer interaction. Its program rests on the assumption that the ability to work in groups with people of different background is now a fundamental condition to produce scientific excellence and to develop innovative skills that can be spent on the job market. ****Required/Preferred Candidate Skills and Competencies**** The candidate should possess basic knowledge on Natural Language Processing and Machine Learning techniques (particularly deep learning architectures and large language models). Experience on biomedical/clinical data will be a plus. Basic programming skills (e.g. Python) would complete the profile. Proficiency in English is required, basic knowledge of Italian preferable. ****Instructions for applicants**** Interested applicants are invited to apply following the instructions given in https://pica.cineca.it/unipd/dottorati41luglio by August 26 2025, 13:00 CEST For further information, please contact: Bernardo Magnini (magnini(a)fbk.eu) -- -- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material.

1 0

Free Lancaster webinar 19 August 2pm UK time
by Brezina, Vaclav 16 Aug '25

16 Aug '25

Dear all, Please join us for a free Lancaster webinar on a very current topic (critical reflection on current developments and challenges in the use of AI): <https://x.com/danagablas/status/1955682624455819296> [??] Navigating Challenges in the use of AI and GenAI in Applied Linguistics by Prof Tony McEnery 19 August 2025 | 2-3pm (UK time) https://forms.office.com/e/uppRBrE5AF<https://t.co/cTo7mg8AVj> Abstract: In the webinar, we will focus on the current developments in the use of GenAI and AI in applied linguistics. In particular, Prof Tony McEnery will explore the impact of AI on applied linguistics, reflecting on the alignment of contemporary AI research with the epistemological, ontological, and ethical traditions of applied linguistics. The talk will discuss the potential affordances of AI and GenAI for applied linguistics as well as some of the challenges that we face when employing AI and GenAI as part of applied linguistics research processes. The goal of this talk is to attempt to align perspectives in these disparate fields and forge a fruitful way ahead for further critical interrogation and integration of AI and GenAI into applied linguistics. Best, Vaclav Professor Vaclav Brezina Professor of Corpus Linguistics Co-Director of the ESRC Centre for Corpus Approaches to Social Science Lancaster, LA1 4YD Office: County South, room C05 T: +44 (0)1524 510828 @vaclavbrezina [cid:ccadda37-dda8-4733-be3d-82f902ed9aea]<http://www.lancaster.ac.uk/arts-and-social-sciences/about-us/people/vaclav-…>

1 0

August 2025 Newsletter - LDC
by Penn LDC 15 Aug '25

15 Aug '25

In this newsletter: LDC at Interspeech 2025 Fall 2025 LDC data scholarship program New publications: Mixer 6 - ChiME 8 Transcribed Calls and Interviews<https://catalog.ldc.upenn.edu/LDC2025S07> Abstract Meaning Representation 2.0 - Machine Translations<https://catalog.ldc.upenn.edu/LDC2025T10> KAIROS Phase 1 Quizlet<https://catalog.ldc.upenn.edu/LDC2025T11> ________________________________ LDC at Interspeech 2025 LDC will be exhibiting at Interspeech 2025<https://www.interspeech2025.org/>, held this year August 17-21 in Rotterdam, the Netherlands. Stop by our booth to say hello and learn about the latest developments at the Consortium. Also be on the lookout for the following presentations, posters, and special sessions featuring LDC work: Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis Monday, August 18, 11:00-13:00 - Area5-Oral1 - Speech Analysis, Detection and Classification 1 Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language Models Tuesday, August 19, 13:30-15:30 - Area1-Poster2B - Databases and Progress in Methodology Special Session: Challenges in Speech Collection, Curation and Annotation<https://sites.google.com/view/speech-data-cca-is25/> Wednesday, August 20, 13:30-15:30 - Area14-SS7 - Part 1 Wednesday, August 20, 16:00-18:00 - Area14-SS8 - Part 2 TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition Thursday, August 21, 13:30-15:30 - AREA4-Oral8 - Speaker Recognition LDC also supported the Interspeech 2025 URGENT Challenge<https://urgent-challenge.github.io/urgent2025/> which aims to bring more attention to constructing Universal, Robust, and Generalizable speech EnhancemeNT models. LDC will post conference updates via our social media platforms. We look forward to seeing you in Rotterdam! Fall 2025 LDC data scholarship program Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>. ________________________________ New publications: Mixer 6 - CHiME 8 Transcribed Calls and Interviews<https://catalog.ldc.upenn.edu/LDC2025S07> was developed for the 7th and 8th CHiME (Computational Hearing in Multisource Environments)<https://www.chimechallenge.org/> challenges. It contains 80 hours of English interviews and telephone speech from Mixer 6 Speech (LDC2013S03)<https://catalog.ldc.upenn.edu/LDC2013S03> with transcripts developed for the CHiME challenges divided into training, development, and test sets. This data was used in CHiME 7 Task 1<https://www.chimechallenge.org/challenges/chime7/task1/index> and CHiME 8 Task 1<https://www.chimechallenge.org/challenges/chime8/task1/>, both of which focused on transcription and segmentation across varied recording conditions such as interviews, meetings, and dinner parties, with an emphasis on generalization across recording device types and array topologies. The data includes audio from Mixer 6 Speech recorded on 13 microphones for a total of 1063 hours (corresponding to 80 hours of speech). The development and test sets are speaker-disjoint from the training data and consist of fully transcribed, multi-microphone interviews. Each transcript segment was labeled with the speaker, the uttered text, and the start and end times in seconds for that segment. 2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. * Abstract Meaning Representation 2.0 - Machine Translations<https://catalog.ldc.upenn.edu/LDC2025T10> was developed at the University of Edinburgh, School of Informatics <https://www.ed.ac.uk/informatics> and the University of Zurich,<https://www.uzh.ch/en.html> Department of Computational Linguistics<https://www.cl.uzh.ch/en.html>. It consists of Spanish, German, Italian, and Mandarin Chinese automatic translations of the source English and professionally-translated Spanish, German, Italian, and Mandarin Chinese sentences in Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07)<https://catalog.ldc.upenn.edu/LDC2020T07>. The translations were collected through Google Translate between May 2018 and March 2024. The source English sentences are a subset (1,371 sentences) of the sentences contained in Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10)<https://catalog.ldc.upenn.edu/LDC2017T10>, a semantic treebank of over 39,000 English natural language sentences from broadcast conversations, newswire, and web text. Translations were from each of the five languages (English, Spanish, German, Italian, and Mandarin Chinese) to the other four languages (Spanish, German, Italian, and Mandarin Chinese) covering 20 language pairs. The dataset contains 1371 source sentences in each language, each with a professionally translated source sentence and multiple dated translations by Google Translate. 2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. * KARIOS Phase 1 Quizlet<https://catalog.ldc.upenn.edu/LDC2025T11> was developed by LDC and contains English and Spanish text, video, and image data and annotations used for pre-evaluation research and system development during Phase 1 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 1 which focused on two real-world complex events (CEs) within the Improvised Explosive Device bombing scenario: CE1001 (2018 Caracas drone attack) and CE1002 (Utah High School backpack bombing). Source data was collected from the web; 30 root web pages were collected and processed, yielding 29 text data files, 216 image files and 5 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations, and arguments and generating a reference knowledge graph. The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions, and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus. 2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. To unsubscribe from this newsletter, log in to your LDC account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance. Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu> M: 3600 Market St. Suite 810 Philadelphia, PA 19104

1 0

RANLP 2025: The 3rd Summer School on Deep Learning and Large Language Models for NLP - Call for Participation
by Petya Osenova 15 Aug '25

15 Aug '25

The 3rd Summer School on Deep Learning and Large Language Models for NLP Call for Participation Website - https://ranlp2025-summer-school.github.io/ We invite everyone interested in Machine Learning and Natural Language Processing to attend the 3rd Summer School on Deep Learning and Large Language Models (LLMs) for Natural Language Processing (NLP), which will be held from September 3–5, 2025, in Varna, Bulgaria, as part of RANLP 2025. Building on the success of the 1st and 2nd RANLP summer schools in 2019 and 2023 respectively, RANLP 2025 summer school will explore a broad spectrum of NLP topics with a special emphasis on LLMs. Each day will feature morning lectures that focus on theoretical foundations, followed by afternoon lab sessions dedicated to hands-on implementation and experimentation. Participants will also have the opportunity to compete in a competition, with awards presented to the top performers. The summer school will feature talks by leading researchers in NLP and deep learning from both academia and industry. Summer School Lecturers * Dr Salima Lamsiyah (Luxembourg University, Luxembourg) * Dr Burcu Can Buglalilar (University of Sterling, UK) * Dr Hansi Hettiarachchi (Lancaster University, UK) * Dr Andrei Mikheev (Daxtra Technologies, UK) * Dr Max Müller-Eberstein (IT University of Copenhagen, Denmark) * Dr Tharindu Ranasinghe (Lancaster University, UK) Summer School Tutors * Maram Alharbi (Lancaster University, UK) * Isuri Nanomi Arachchige (Lancaster University, UK) * Salmane Chafik (Mohammed VI Polytechnic University, Morocco) * Ernesto Luis Estevanell (University of Alicante, Spain) * Alexander Mikheev (Daxtra Technologies, UK) * Damith Dola Mullage Premasiri (Lancaster University, UK) Programme Day 1 NLP/ DL Foundation Day 2 LLM Foundation Day 3 LLM Applications 09:00 – 10:30 Introduction to NLP and Deep Learning Dr Tharindu Ranasinghe Introduction to LLMs Dr Burcu Can Training a Danish LLM: Lessons Learned Dr Max Müller-Eberstein 10:30 – 11:00 Coffee/ Tea Break 11:00 – 12:30 Language Models and Beyond Dr Hansi Hettiarachchi Evaluating and Benchmarking LLMs Dr Salima Lamsiyah LLMs in Recruitment Sector (Tentative) Dr Andrei Mikheev and Alexander Mikheev 12:30 – 2:00 Lunch 2:00 – 3:00 Practical Session I: Word embeddings and Deep Learning in NLP Practical Session III: Prompting and finetuning LLMs Practical Session V: Implementing LLMs in the Legal Domain 3:00 – 3:30 Introducing the Summer School Competition, Your Teams and Mentors 3:30 – 4:00 Coffee/ Tea Break 4:00 – 5:30 Practical Session II: Transformers in NLP Practical Session IV: Tools for LLMs: LangGraph Practical Session VI: LLMs for Code Generation 5:30 – 6:00 Open Session with Mentors and Teams Open Session with Mentors and Teams Awards and Closing We are looking forward to your participation! The organisers of RANLP 2025 Summer School

1 0

2026

2025

2024

2023

2022