Learn More

LLMs, Big Data, and Multilinguality for All (LLMs4All)

Introduction

The rapid evolution of Large Language Models (LLMs) has reshaped the landscape of natural language processing (NLP), enabling remarkable progress in tasks such as machine translation, text summarisation, question answering, and speech recognition. Yet, these advances remain unevenly distributed, with low-resource languages continuing to face significant barriers. Limited data availability, linguistic complexity, and the dominance of high-resource languages in public datasets have led to persistent disparities in LLM performance and accessibility—undermining efforts toward global linguistic inclusivity and digital equity.

The rise of Big Data offers new avenues to mitigate these challenges. By harnessing large-scale datasets from digital platforms, social media, online archives, and multilingual corpora, researchers are better positioned to build robust models that reflect the linguistic richness of underrepresented languages. Techniques such as web scraping, data augmentation, and scalable data pipelines have become instrumental in collecting and curating the diverse data necessary for training and fine-tuning LLMs. Moreover, advanced approaches like cross-lingual transfer learning and multilingual embeddings allow for the transfer of knowledge from high-resource to low-resource languages, improving model effectiveness despite limited local resources.

This workshop seeks to examine how Big Data methodologies and LLM architectures can be leveraged together to advance NLP research across low-resource settings. Topics will span innovative data collection strategies, efficient training techniques, and practical applications of LLMs in multilingual and low-resource contexts. Particular attention will be paid to solutions that address data scarcity, promote generalisability, and preserve linguistic diversity.

Given the location of this year’s conference in Macau, a core emphasis will be placed on low-resource Asian languages, which pose unique challenges due to the scarcity of high-quality annotated resources. The workshop will spotlight research tackling these challenges, including annotation methodologies and regionally grounded collaborations. More broadly, we welcome submissions on all low-resource and underrepresented languages, as well as efforts that enhance the accessibility of LLMs across diverse languages, countries, domains, and social contexts—whether through inclusive data practices, novel modelling strategies, or real-world deployment under resource constraints. We also encourage contributions on multilingual LLMs, scalable architectures, and responsible AI, with the aim of fostering cross-disciplinary dialogue and catalysing future collaborations toward equitable and inclusive language technologies.

Research Topics:

  • Scalable Data Collection and Curation: Efficient strategies for collecting, annotating, and managing large-scale multilingual datasets from diverse sources such as text, speech, video, and social media, particularly for underrepresented or emerging languages.
  • Cross-Lingual and Multilingual Learning: Techniques such as transfer learning, multilingual embeddings, and knowledge distillation to improve LLM performance for low-resource settings by leveraging high-resource language data.
  • Efficient and Inclusive Model Training: Approaches that reduce the resource burden of training and fine-tuning LLMs, including parameter-efficient fine-tuning (e.g., LoRA, PEFT), quantisation, and model distillation—enabling deployment in constrained environments.
  • Retrieval-Augmented Generation (RAG) and External Knowledge Integration: Enhancing LLMs with dynamic access to structured or unstructured external data to improve context relevance and factual grounding in multilingual use cases.
  • Multimodal Language Models: Developing and applying LLMs that integrate multiple modalities (e.g., text, image, audio) to support richer interaction in linguistically diverse environments.
  • Real-World Applications: Deployments of LLMs for machine translation, summarisation, conversational AI, question answering, and speech processing in low-resource languages or multilingual contexts.
  • Big Data Infrastructure and Pipelines: Architectures and tools for managing data at scale, including distributed processing, data lakes, and cloud-based platforms tailored for NLP research.
  • Ethical, Fair, and Culturally Aware AI: Research addressing linguistic bias, representation, digital colonialism, and fairness in LLMs, with a focus on inclusive design and evaluation practices.
  • Benchmarking and Evaluation: Creation of shared benchmarks, evaluation frameworks, and metrics that reflect the realities of low-resource and multilingual settings, including measures for faithfulness, robustness, and generalisability.
  • Regional Case Studies and Collaborative Initiatives: Projects and insights from Southeast Asia and beyond that demonstrate the challenges and successes of building and deploying LLMs in real-world low-resource scenarios, including healthcare, education, legal, and governmental domains.

Important dates:

  • Oct 1, 2025: Full workshop paper submission deadline
  • Nov 4, 2025: Notification of paper acceptance
  • Nov 23, 2025: Camera-ready paper submission
  • Dec 5-8, 2025: Workshop date

LLMs4All

Meet the core team behind the LLMs, Big Data, and Multilinguality for All (LLMs4All) workshop