Dear colleagues, We are happy to announce that we have released the second version of the South Slavic CLASSLA-web corpora. The corpus collection contains approximately 38 million texts and 17 billion words, collected from the web in 2024, and covers the full South Slavic language group: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. Compared to CLASSLA-web 1.0, the new web corpora are significantly expanded and largely consist of new texts. The corpora are linguistically annotated, automatically classified by genre and enriched with topic labels. The web corpus collection is intended for a wide range of uses, including corpus linguistics, lexicography, and other linguistic research, as well as for natural language processing tasks such as training and evaluating language models, and creating genre- or topic-specific datasets.

A detailed description of the resource can be found in the accompanying paper (https://doi.org/10.48550/arXiv.2601.11170). Further information on both CLASSLA-web 1.0 and 2.0 versions, including details on corpus construction, additional resources, a video describing the workflow, and citation guidelines, is available on the CLASSLA-web website: https://clarinsi.github.io/classla-web/ If you are interested in language resources and technologies for South Slavic languages, we invite you to browse the CLASSLA-web corpora via the CLARIN.SI concordancers (https://www.clarin.si/ske/#open) or download them under a CC0 license from the CLARIN.SI repository: http://hdl.handle.net/11356/2079
Best wishes, CLASSLA-web authors: Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel and Nikola Ljubešić, supported by CLARIN.SI and CLASSLA