The submission deadline is now closed.
The workshop will take place on 10 December 2025 as a hybrid event (09:00–17:00 Macau local time).
To attend, please register for free using the webinar link:
Registration Link (Free)
The rapid evolution of Large Language Models (LLMs) has reshaped the landscape of natural language processing (NLP), enabling remarkable progress in tasks such as machine translation, text summarisation, question answering, and speech recognition. Yet, these advances remain unevenly distributed, with low-resource languages continuing to face significant barriers. Limited data availability, linguistic complexity, and the dominance of high-resource languages in public datasets have led to persistent disparities in LLM performance and accessibility—undermining efforts toward global linguistic inclusivity and digital equity.
The rise of Big Data offers new avenues to mitigate these challenges. By harnessing large-scale datasets from digital platforms, social media, online archives, and multilingual corpora, researchers are better positioned to build robust models that reflect the linguistic richness of underrepresented languages. Techniques such as web scraping, data augmentation, and scalable data pipelines have become instrumental in collecting and curating the diverse data necessary for training and fine-tuning LLMs. Moreover, advanced approaches like cross-lingual transfer learning and multilingual embeddings allow for the transfer of knowledge from high-resource to low-resource languages, improving model effectiveness despite limited local resources.
This workshop seeks to examine how Big Data methodologies and LLM architectures can be leveraged together to advance NLP research across low-resource settings. Topics will span innovative data collection strategies, efficient training techniques, and practical applications of LLMs in multilingual and low-resource contexts. Particular attention will be paid to solutions that address data scarcity, promote generalisability, and preserve linguistic diversity.
Given the location of this year’s conference in Macau, a core emphasis will be placed on low-resource languages, which pose unique challenges due to the scarcity of high-quality annotated resources. The workshop will spotlight research tackling these challenges, including annotation methodologies and regionally grounded collaborations. More broadly, we welcome submissions on all low-resource and underrepresented languages, as well as efforts that enhance the accessibility of LLMs across diverse languages, countries, domains, and social contexts—whether through inclusive data practices, novel modelling strategies, or real-world deployment under resource constraints. We also encourage contributions on multilingual LLMs, scalable architectures, and responsible AI, with the aim of fostering cross-disciplinary dialogue and catalysing future collaborations toward equitable and inclusive language technologies.
Paper Submission:
The submission deadline is now closed and we are no longer accepting papers.
Thank you to all authors who submitted their work. Accepted papers are now listed below along with the workshop schedule.
Important Dates:
The workshop will take place on 10 December 2025, 09:00–17:00 (Macau local time).
| Paper Title | Authors |
|---|---|
| IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline for Efficient and Accurate Knowledge Retrieval in Large-Scale Private Domains | Xia Jiaju, Cao Lei |
| Polypersona: Persona-Grounded LLM for Synthetic Survey Responses | Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu |
| Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts | S. I. M. Adnan, Abrar Hameem, Shikha Anirban, Md. Saiful Islam, Md. Musfique Anwar |
| How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation | Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa |
| Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth | Ana Elvira Ojanguren López, Javier Martín Arista, Darío Metola Rodríguez |
| Enhanced Old English NER via Morphology-Aware Analysis, Cross-Germanic Transfer, and Domain-Specific Patterns | Javier Martín Arista, Darío Metola Rodríguez, Daniel B. Morris |
| Benchmarking LLM Optimization Strategies for Clinical NER: A Comparative Analysis of DSPy GEPA against Domain-Specific Transformers | Justin Varghese, Yi Shang |
| Arabic Prompts with English Tools: A Benchmark | Konstantin Kubrak, Ahmed El-Moselhy, Ammar Alsulami, Remaz Altuwaim, Hassan Ismail Fawaz, Faisal Alsaby |
| AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs | Mo El-Haj, Paul Rayson |
| Temporal-Aware RAG for Multilingual ESG Document Retrieval: A Low-Resource Approach to Time-Sensitive Question Answering | Nguyen Anh Kiet Truong |
| UniFi-LLM: A Unified Large Language Model for Financial Data Generation and Fraud Prediction | Giridhar Pamisetty, Subbareddy Batreddy, Priya Verma, Sobhan Babu Chintapalli |
| Tackling Low-Resource K-12 Hand-Drawn Mathematics VQA: Unified Regularization with Compute-Aware Expert Token Architecture | Hai Li, Wanli Xing, Chenglu Li, Bailing Lyu |
| Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis | Felipe Ribeiro Fujita de Mello, Hideyuki Takada |
| Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models | Haomin Qi, Chengbo Huang, Zihan Dai, Yunkai Gao |
| Leveraging LLM Agents for Autonomous Web Penetration Testing Targeting SQL Injection Vulnerability | Thanh Phong Tran, Le Bao Phuc Nguyen, Trong Nghia To, Van Hau Pham, The Duy Phan |
| XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs | Iñaki Lacunza, José Javier Saiz, Alexander Shvets, Aitor Gonzalez-Agirre, Marta Villegas |
| Arabic OCR in the Age of Multimodal Models: A Comprehensive Comparative Evaluation | Hossam Elsafty, Farizeh Aldabbas, Rafet Sifa |
| Copyright Infringement Issues and Mitigations in Data for Training Generative AI | Anna Arnaudo, Riccardo Coppola, Maurizio Morisio, Antonio Vetrò, Maurizio Borghi, Bryan Khan, Riccardo Raso |
| Cluster-aware Item Prompt Learning for Session-based Recommendation | Wooseong Yang, Chen Wang, Zihe Song, Weizhi Zhang, Philip S. Yu |
| Fine-tuning Large-Language-Models using Federated Learning & Blockchain | Soham Ratnaparkhi, Saeed Samet |
| Targeted Knowledge Enhancement: A Systematic Continual Pre-training Approach for Effective Domain Adaptation | Yiqun Wang, Chaoqun Wan, Xiang Tian, Xuesong Liu, Yaowu Chen |
| Time | Title | Presenter |
|---|---|---|
| 09:00–09:15 | Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts | Md. Saiful Islam |
| 09:15–09:30 | Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth | Ana Elvira Ojanguren López |
| 09:30–09:45 | Polypersona: Persona-Grounded LLM for Synthetic Survey Responses | Anudeep Vurity |
| 09:45–10:00 | IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline | Cao Lei |
| 10:00–10:15 | Benchmarking LLM Optimisation Strategies for Clinical NER | Justin Varghese |
| 10:15–10:30 | Targeted Knowledge Enhancement | Yiqun Wang |
| 10:30–11:00 | Coffee break | |
| 11:00–11:15 | Enhanced Old English NER | Javier Martín Arista |
| 11:15–11:30 | Arabic Prompts with English Tools: A Benchmark | Konstantin Kubrak |
| 11:30–11:45 | UniFi-LLM | Giridhar Pamisetty |
| 11:45–12:00 | Federated Learning & Blockchain | Soham Ratnaparkhi |
| 12:00–12:15 | LLM Agents for Web Penetration Testing | Thanh Phong Tran |
| 12:15–12:30 | Temporal-Aware RAG for Multilingual ESG | Nguyen Anh Kiet Truong |
| 12:30–14:00 | Lunch break | |
| 14:00–14:15 | AraFinNews: Arabic Financial Summarisation | Mo El-Haj |
| 14:15–14:30 | XDoGE: Multilingual Data Reweighting | Iñaki Lacunza |
| 14:30–14:45 | Copyright Infringement in Generative AI | Anna Arnaudo |
| 14:45–15:00 | Arabic OCR in the Age of Multimodal Models | Hossam Elsafty |
| 15:00–15:15 | Compact LMs for On-Device MT Error Detection | Muskaan Chopra |
| 15:15–15:30 | Better Data Selection for LLM Fine-Tuning | Felipe Ribeiro Fujita de Mello |
| 15:30–16:00 | Coffee break | |
| 16:00–16:15 | Governance-Aware Hybrid Fine-Tuning | Haomin Qi |
| 16:15–16:30 | Low-Resource K-12 Math VQA | Hai Li |
| 16:30–16:45 | Cluster-aware Item Prompt Learning | Wooseong Yang |
Meet the core team behind the LLMs, Big Data, and Multilinguality for All (LLMs4All) workshop
Address: CECS, VinUniversity, Hanoi, Vietnam
Email: elhaj.m@vinuni.edu.vn
Website: NLP @ VinUniversity