Learn More

LLMs, Big Data, and Multilinguality for All (LLMs4All)

Workshop at IEEE BigData 2025 Conference, Macau – December 8–11, 2025

The submission deadline is now closed.

The workshop will take place on 10 December 2025 as a hybrid event (09:00–17:00 Macau local time).

To attend, please register for free using the webinar link:
Registration Link (Free)

Introduction

The rapid evolution of Large Language Models (LLMs) has reshaped the landscape of natural language processing (NLP), enabling remarkable progress in tasks such as machine translation, text summarisation, question answering, and speech recognition. Yet, these advances remain unevenly distributed, with low-resource languages continuing to face significant barriers. Limited data availability, linguistic complexity, and the dominance of high-resource languages in public datasets have led to persistent disparities in LLM performance and accessibility—undermining efforts toward global linguistic inclusivity and digital equity.

The rise of Big Data offers new avenues to mitigate these challenges. By harnessing large-scale datasets from digital platforms, social media, online archives, and multilingual corpora, researchers are better positioned to build robust models that reflect the linguistic richness of underrepresented languages. Techniques such as web scraping, data augmentation, and scalable data pipelines have become instrumental in collecting and curating the diverse data necessary for training and fine-tuning LLMs. Moreover, advanced approaches like cross-lingual transfer learning and multilingual embeddings allow for the transfer of knowledge from high-resource to low-resource languages, improving model effectiveness despite limited local resources.

This workshop seeks to examine how Big Data methodologies and LLM architectures can be leveraged together to advance NLP research across low-resource settings. Topics will span innovative data collection strategies, efficient training techniques, and practical applications of LLMs in multilingual and low-resource contexts. Particular attention will be paid to solutions that address data scarcity, promote generalisability, and preserve linguistic diversity.

Given the location of this year’s conference in Macau, a core emphasis will be placed on low-resource languages, which pose unique challenges due to the scarcity of high-quality annotated resources. The workshop will spotlight research tackling these challenges, including annotation methodologies and regionally grounded collaborations. More broadly, we welcome submissions on all low-resource and underrepresented languages, as well as efforts that enhance the accessibility of LLMs across diverse languages, countries, domains, and social contexts—whether through inclusive data practices, novel modelling strategies, or real-world deployment under resource constraints. We also encourage contributions on multilingual LLMs, scalable architectures, and responsible AI, with the aim of fostering cross-disciplinary dialogue and catalysing future collaborations toward equitable and inclusive language technologies.

Research Topics:

  • Scalable Data Collection and Curation: Efficient strategies for collecting, annotating, and managing large-scale multilingual datasets from diverse sources such as text, speech, video, and social media, particularly for underrepresented or emerging languages.
  • Cross-Lingual and Multilingual Learning: Techniques such as transfer learning, multilingual embeddings, and knowledge distillation to improve LLM performance for low-resource settings by leveraging high-resource language data.
  • Efficient and Inclusive Model Training: Approaches that reduce the resource burden of training and fine-tuning LLMs, including parameter-efficient fine-tuning (e.g., LoRA, PEFT), quantisation, and model distillation—enabling deployment in constrained environments.
  • Retrieval-Augmented Generation (RAG) and External Knowledge Integration: Enhancing LLMs with dynamic access to structured or unstructured external data to improve context relevance and factual grounding in multilingual use cases.
  • Multimodal Language Models: Developing and applying LLMs that integrate multiple modalities (e.g., text, image, audio) to support richer interaction in linguistically diverse environments.
  • Real-World Applications: Deployments of LLMs for machine translation, summarisation, conversational AI, question answering, and speech processing in low-resource languages or multilingual contexts.
  • Big Data Infrastructure and Pipelines: Architectures and tools for managing data at scale, including distributed processing, data lakes, and cloud-based platforms tailored for NLP research.
  • Ethical, Fair, and Culturally Aware AI: Research addressing linguistic bias, representation, digital colonialism, and fairness in LLMs, with a focus on inclusive design and evaluation practices.
  • Benchmarking and Evaluation: Creation of shared benchmarks, evaluation frameworks, and metrics that reflect the realities of low-resource and multilingual settings, including measures for faithfulness, robustness, and generalisability.
  • Regional Case Studies and Collaborative Initiatives: Projects and insights from Southeast Asia and beyond that demonstrate the challenges and successes of building and deploying LLMs in real-world low-resource scenarios, including healthcare, education, legal, and governmental domains.

Paper Submission:

The submission deadline is now closed and we are no longer accepting papers.


Thank you to all authors who submitted their work. Accepted papers are now listed below along with the workshop schedule.



Important Dates:

  • First Call for Papers: 25 June 2025
  • Second Call for Papers: 28 September 2025
  • Final Call for Papers: 17 October 2025
  • Submission Deadline: 1 November 2025 (now closed)
  • Notification of Acceptance: 15 November 2025
  • Camera Ready Deadline: 23 November 2025 (strict)

The workshop will take place on 10 December 2025, 09:00–17:00 (Macau local time).

<

Accepted Papers

Paper Title Authors
IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline for Efficient and Accurate Knowledge Retrieval in Large-Scale Private Domains Xia Jiaju, Cao Lei
Polypersona: Persona-Grounded LLM for Synthetic Survey Responses Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu
Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts S. I. M. Adnan, Abrar Hameem, Shikha Anirban, Md. Saiful Islam, Md. Musfique Anwar
How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa
Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth Ana Elvira Ojanguren López, Javier Martín Arista, Darío Metola Rodríguez
Enhanced Old English NER via Morphology-Aware Analysis, Cross-Germanic Transfer, and Domain-Specific Patterns Javier Martín Arista, Darío Metola Rodríguez, Daniel B. Morris
Benchmarking LLM Optimization Strategies for Clinical NER: A Comparative Analysis of DSPy GEPA against Domain-Specific Transformers Justin Varghese, Yi Shang
Arabic Prompts with English Tools: A Benchmark Konstantin Kubrak, Ahmed El-Moselhy, Ammar Alsulami, Remaz Altuwaim, Hassan Ismail Fawaz, Faisal Alsaby
AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs Mo El-Haj, Paul Rayson
Temporal-Aware RAG for Multilingual ESG Document Retrieval: A Low-Resource Approach to Time-Sensitive Question Answering Nguyen Anh Kiet Truong
UniFi-LLM: A Unified Large Language Model for Financial Data Generation and Fraud Prediction Giridhar Pamisetty, Subbareddy Batreddy, Priya Verma, Sobhan Babu Chintapalli
Tackling Low-Resource K-12 Hand-Drawn Mathematics VQA: Unified Regularization with Compute-Aware Expert Token Architecture Hai Li, Wanli Xing, Chenglu Li, Bailing Lyu
Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis Felipe Ribeiro Fujita de Mello, Hideyuki Takada
Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models Haomin Qi, Chengbo Huang, Zihan Dai, Yunkai Gao
Leveraging LLM Agents for Autonomous Web Penetration Testing Targeting SQL Injection Vulnerability Thanh Phong Tran, Le Bao Phuc Nguyen, Trong Nghia To, Van Hau Pham, The Duy Phan
XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs Iñaki Lacunza, José Javier Saiz, Alexander Shvets, Aitor Gonzalez-Agirre, Marta Villegas
Arabic OCR in the Age of Multimodal Models: A Comprehensive Comparative Evaluation Hossam Elsafty, Farizeh Aldabbas, Rafet Sifa
Copyright Infringement Issues and Mitigations in Data for Training Generative AI Anna Arnaudo, Riccardo Coppola, Maurizio Morisio, Antonio Vetrò, Maurizio Borghi, Bryan Khan, Riccardo Raso
Cluster-aware Item Prompt Learning for Session-based Recommendation Wooseong Yang, Chen Wang, Zihe Song, Weizhi Zhang, Philip S. Yu
Fine-tuning Large-Language-Models using Federated Learning & Blockchain Soham Ratnaparkhi, Saeed Samet
Targeted Knowledge Enhancement: A Systematic Continual Pre-training Approach for Effective Domain Adaptation Yiqun Wang, Chaoqun Wan, Xiang Tian, Xuesong Liu, Yaowu Chen

Workshop Schedule (10 December 2025)

Time Title Presenter
09:00–09:15Evaluation of Large Language Models for Understanding Counterfactual Reasoning in TextsMd. Saiful Islam
09:15–09:30Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data GrowthAna Elvira Ojanguren López
09:30–09:45Polypersona: Persona-Grounded LLM for Synthetic Survey ResponsesAnudeep Vurity
09:45–10:00IDR-RAG: An Iterative Draft-Revision Agent-like RAG PipelineCao Lei
10:00–10:15Benchmarking LLM Optimisation Strategies for Clinical NERJustin Varghese
10:15–10:30Targeted Knowledge EnhancementYiqun Wang
10:30–11:00Coffee break
11:00–11:15Enhanced Old English NERJavier Martín Arista
11:15–11:30Arabic Prompts with English Tools: A BenchmarkKonstantin Kubrak
11:30–11:45UniFi-LLMGiridhar Pamisetty
11:45–12:00Federated Learning & BlockchainSoham Ratnaparkhi
12:00–12:15LLM Agents for Web Penetration TestingThanh Phong Tran
12:15–12:30Temporal-Aware RAG for Multilingual ESGNguyen Anh Kiet Truong
12:30–14:00Lunch break
14:00–14:15AraFinNews: Arabic Financial SummarisationMo El-Haj
14:15–14:30XDoGE: Multilingual Data ReweightingIñaki Lacunza
14:30–14:45Copyright Infringement in Generative AIAnna Arnaudo
14:45–15:00Arabic OCR in the Age of Multimodal ModelsHossam Elsafty
15:00–15:15Compact LMs for On-Device MT Error DetectionMuskaan Chopra
15:15–15:30Better Data Selection for LLM Fine-TuningFelipe Ribeiro Fujita de Mello
15:30–16:00Coffee break
16:00–16:15Governance-Aware Hybrid Fine-TuningHaomin Qi
16:15–16:30Low-Resource K-12 Math VQAHai Li
16:30–16:45Cluster-aware Item Prompt LearningWooseong Yang

LLMs4All

Meet the core team behind the LLMs, Big Data, and Multilinguality for All (LLMs4All) workshop