The rapid evolution of Large Language Models (LLMs) has reshaped the landscape of natural language processing (NLP), enabling remarkable progress in tasks such as machine translation, text summarisation, question answering, and speech recognition. Yet, these advances remain unevenly distributed, with low-resource languages continuing to face significant barriers. Limited data availability, linguistic complexity, and the dominance of high-resource languages in public datasets have led to persistent disparities in LLM performance and accessibility—undermining efforts toward global linguistic inclusivity and digital equity.
The rise of Big Data offers new avenues to mitigate these challenges. By harnessing large-scale datasets from digital platforms, social media, online archives, and multilingual corpora, researchers are better positioned to build robust models that reflect the linguistic richness of underrepresented languages. Techniques such as web scraping, data augmentation, and scalable data pipelines have become instrumental in collecting and curating the diverse data necessary for training and fine-tuning LLMs. Moreover, advanced approaches like cross-lingual transfer learning and multilingual embeddings allow for the transfer of knowledge from high-resource to low-resource languages, improving model effectiveness despite limited local resources.
This workshop seeks to examine how Big Data methodologies and LLM architectures can be leveraged together to advance NLP research across low-resource settings. Topics will span innovative data collection strategies, efficient training techniques, and practical applications of LLMs in multilingual and low-resource contexts. Particular attention will be paid to solutions that address data scarcity, promote generalisability, and preserve linguistic diversity.
Given the location of this year’s conference in Macau, a core emphasis will be placed on low-resource Asian languages, which pose unique challenges due to the scarcity of high-quality annotated resources. The workshop will spotlight research tackling these challenges, including annotation methodologies and regionally grounded collaborations. More broadly, we welcome submissions on all low-resource and underrepresented languages, as well as efforts that enhance the accessibility of LLMs across diverse languages, countries, domains, and social contexts—whether through inclusive data practices, novel modelling strategies, or real-world deployment under resource constraints. We also encourage contributions on multilingual LLMs, scalable architectures, and responsible AI, with the aim of fostering cross-disciplinary dialogue and catalysing future collaborations toward equitable and inclusive language technologies.
Important dates:
Meet the core team behind the LLMs, Big Data, and Multilinguality for All (LLMs4All) workshop
Address: CECS, VinUniversity, Hanoi, Vietnam
Email: elhaj.m@vinuni.edu.vn
Website: vinnlp.com