In this newsletter:
LDC at ACL 2023
LDC data and commercial technology development
New publications:
Moroccan Arabic – English Lexical Database
LORELEI Indonesian Representative Language Pack
LDC at ACL 202
LDC will be exhibiting at ACL 2023, held this year July 9-14 in Toronto, Canada. Stop by our booth to learn more about recent developments at the Consortium and the latest publications. LDC will post conference updates
via Twitter and Facebook. We look forward to seeing you there!
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop
or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing
page for further information.
New publications:
Moroccan
Arabic - English Lexical Database
was developed by LDC. It contains a set of five interrelated tables presenting each Moroccan Arabic word as an orthographic form in Arabic script and a pronunciation form in International Phonetic Alphabet (IPA) format. This release contains over 21,000 Moroccan
Arabic words in Arabic script and IPA notation, and more than 33,000 English tokens.
This lexical database is the result of a collaboration with Georgetown
University Press (GUP) to
enhance and update three dialectal Arabic dictionaries -- Iraqi, Moroccan, and Syrian -- originally published in paper form in the 1960s by GUP. LDC also undertook to develop a lexical database for each dialect. The
Georgetown Dictionary of Moroccan Arabic was
published in 2019; this work was based on, and expanded, A
Dictionary of Moroccan Arabic.
The several enhancements developed by LDC included facilitating comparisons across Arabic dialects and Modern Standard Arabic by providing Arabic script spellings and IPA pronunciations to Moroccan words and phrases; promoting ease of use by language
learners and researchers by developing reasonable orthographic conventions for applying the Arabic alphabet to the dialect; and facilitating a user's understanding of morphological and lexical relations by adding information on the linguistic structures of
Moroccan Arabic.
2023 members
can access this corpus through their LDC accounts provided they have submitted
a signed copy of the special license agreement. Non-members may license this
data for a fee.
*
LORELEI
Indonesian Representative Language Pack
is comprised of over 17 million words of Indonesian monolingual text, 950,000 million words of found Indonesian-English parallel text, and 92,000 Indonesian words translated from English data. Over 113,000 words were annotated for named entities and more than
24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs, and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as
LORELEI
Entity Detection and Linking Knowledge Base (LDC2020T10).
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC
account and uncheck
the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.
Membership Coordinator
University of Pennsylvania
T: +1-215-573-1275
M: 3600 Market St. Suite 810
Philadelphia, PA 19104