NLP @ VinUniversity

LLMs, Big Data, and Multilinguality for All (LLMs4All)

Workshop at IEEE BigData 2025 Conference, Macau – December 8–11, 2025

The submission deadline is now closed.

The workshop will take place on 10 December 2025 as a hybrid event (09:00–17:00 Macau local time).

To attend, please register for free using the webinar link:
Registration Link (Free)

Introduction

The rapid evolution of Large Language Models (LLMs) has reshaped the landscape of natural language processing (NLP), enabling remarkable progress in tasks such as machine translation, text summarisation, question answering, and speech recognition. Yet, these advances remain unevenly distributed, with low-resource languages continuing to face significant barriers. Limited data availability, linguistic complexity, and the dominance of high-resource languages in public datasets have led to persistent disparities in LLM performance and accessibility—undermining efforts toward global linguistic inclusivity and digital equity.

The rise of Big Data offers new avenues to mitigate these challenges. By harnessing large-scale datasets from digital platforms, social media, online archives, and multilingual corpora, researchers are better positioned to build robust models that reflect the linguistic richness of underrepresented languages. Techniques such as web scraping, data augmentation, and scalable data pipelines have become instrumental in collecting and curating the diverse data necessary for training and fine-tuning LLMs. Moreover, advanced approaches like cross-lingual transfer learning and multilingual embeddings allow for the transfer of knowledge from high-resource to low-resource languages, improving model effectiveness despite limited local resources.

This workshop seeks to examine how Big Data methodologies and LLM architectures can be leveraged together to advance NLP research across low-resource settings. Topics will span innovative data collection strategies, efficient training techniques, and practical applications of LLMs in multilingual and low-resource contexts. Particular attention will be paid to solutions that address data scarcity, promote generalisability, and preserve linguistic diversity.

Given the location of this year’s conference in Macau, a core emphasis will be placed on low-resource languages, which pose unique challenges due to the scarcity of high-quality annotated resources. The workshop will spotlight research tackling these challenges, including annotation methodologies and regionally grounded collaborations. More broadly, we welcome submissions on all low-resource and underrepresented languages, as well as efforts that enhance the accessibility of LLMs across diverse languages, countries, domains, and social contexts—whether through inclusive data practices, novel modelling strategies, or real-world deployment under resource constraints. We also encourage contributions on multilingual LLMs, scalable architectures, and responsible AI, with the aim of fostering cross-disciplinary dialogue and catalysing future collaborations toward equitable and inclusive language technologies.

Research Topics:

Scalable Data Collection and Curation: Efficient strategies for collecting, annotating, and managing large-scale multilingual datasets from diverse sources such as text, speech, video, and social media, particularly for underrepresented or emerging languages.
Cross-Lingual and Multilingual Learning: Techniques such as transfer learning, multilingual embeddings, and knowledge distillation to improve LLM performance for low-resource settings by leveraging high-resource language data.
Efficient and Inclusive Model Training: Approaches that reduce the resource burden of training and fine-tuning LLMs, including parameter-efficient fine-tuning (e.g., LoRA, PEFT), quantisation, and model distillation—enabling deployment in constrained environments.
Retrieval-Augmented Generation (RAG) and External Knowledge Integration: Enhancing LLMs with dynamic access to structured or unstructured external data to improve context relevance and factual grounding in multilingual use cases.
Multimodal Language Models: Developing and applying LLMs that integrate multiple modalities (e.g., text, image, audio) to support richer interaction in linguistically diverse environments.
Real-World Applications: Deployments of LLMs for machine translation, summarisation, conversational AI, question answering, and speech processing in low-resource languages or multilingual contexts.
Big Data Infrastructure and Pipelines: Architectures and tools for managing data at scale, including distributed processing, data lakes, and cloud-based platforms tailored for NLP research.
Ethical, Fair, and Culturally Aware AI: Research addressing linguistic bias, representation, digital colonialism, and fairness in LLMs, with a focus on inclusive design and evaluation practices.
Benchmarking and Evaluation: Creation of shared benchmarks, evaluation frameworks, and metrics that reflect the realities of low-resource and multilingual settings, including measures for faithfulness, robustness, and generalisability.
Regional Case Studies and Collaborative Initiatives: Projects and insights from Southeast Asia and beyond that demonstrate the challenges and successes of building and deploying LLMs in real-world low-resource scenarios, including healthcare, education, legal, and governmental domains.

Paper Submission:

The submission deadline is now closed and we are no longer accepting papers.

Thank you to all authors who submitted their work. Accepted papers are now listed below along with the workshop schedule.

Important Dates:

~~First Call for Papers: 25 June 2025~~
~~Second Call for Papers: 28 September 2025~~
~~Final Call for Papers: 17 October 2025~~
~~Submission Deadline: 1 November 2025~~ (now closed)
~~Notification of Acceptance: 15 November 2025~~
~~Camera Ready Deadline: 23 November 2025~~ (strict)

The workshop will take place on 10 December 2025, 09:00–17:00 (Macau local time).

Accepted Papers

Paper Title	Authors
IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline for Efficient and Accurate Knowledge Retrieval in Large-Scale Private Domains	Xia Jiaju, Cao Lei
Polypersona: Persona-Grounded LLM for Synthetic Survey Responses	Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu
Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts	S. I. M. Adnan, Abrar Hameem, Shikha Anirban, Md. Saiful Islam, Md. Musfique Anwar
How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation	Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa
Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth	Ana Elvira Ojanguren López, Javier Martín Arista, Darío Metola Rodríguez
Enhanced Old English NER via Morphology-Aware Analysis, Cross-Germanic Transfer, and Domain-Specific Patterns	Javier Martín Arista, Darío Metola Rodríguez, Daniel B. Morris
Benchmarking LLM Optimization Strategies for Clinical NER: A Comparative Analysis of DSPy GEPA against Domain-Specific Transformers	Justin Varghese, Yi Shang
Arabic Prompts with English Tools: A Benchmark	Konstantin Kubrak, Ahmed El-Moselhy, Ammar Alsulami, Remaz Altuwaim, Hassan Ismail Fawaz, Faisal Alsaby
AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs	Mo El-Haj, Paul Rayson
Temporal-Aware RAG for Multilingual ESG Document Retrieval: A Low-Resource Approach to Time-Sensitive Question Answering	Nguyen Anh Kiet Truong
UniFi-LLM: A Unified Large Language Model for Financial Data Generation and Fraud Prediction	Giridhar Pamisetty, Subbareddy Batreddy, Priya Verma, Sobhan Babu Chintapalli
Tackling Low-Resource K-12 Hand-Drawn Mathematics VQA: Unified Regularization with Compute-Aware Expert Token Architecture	Hai Li, Wanli Xing, Chenglu Li, Bailing Lyu
Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis	Felipe Ribeiro Fujita de Mello, Hideyuki Takada
Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models	Haomin Qi, Chengbo Huang, Zihan Dai, Yunkai Gao
Leveraging LLM Agents for Autonomous Web Penetration Testing Targeting SQL Injection Vulnerability	Thanh Phong Tran, Le Bao Phuc Nguyen, Trong Nghia To, Van Hau Pham, The Duy Phan
XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs	Iñaki Lacunza, José Javier Saiz, Alexander Shvets, Aitor Gonzalez-Agirre, Marta Villegas
Arabic OCR in the Age of Multimodal Models: A Comprehensive Comparative Evaluation	Hossam Elsafty, Farizeh Aldabbas, Rafet Sifa
Copyright Infringement Issues and Mitigations in Data for Training Generative AI	Anna Arnaudo, Riccardo Coppola, Maurizio Morisio, Antonio Vetrò, Maurizio Borghi, Bryan Khan, Riccardo Raso
Cluster-aware Item Prompt Learning for Session-based Recommendation	Wooseong Yang, Chen Wang, Zihe Song, Weizhi Zhang, Philip S. Yu
Fine-tuning Large-Language-Models using Federated Learning & Blockchain	Soham Ratnaparkhi, Saeed Samet
Targeted Knowledge Enhancement: A Systematic Continual Pre-training Approach for Effective Domain Adaptation	Yiqun Wang, Chaoqun Wan, Xiang Tian, Xuesong Liu, Yaowu Chen

Paper Title

Authors

IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline for Efficient and Accurate Knowledge Retrieval in Large-Scale Private Domains

Xia Jiaju, Cao Lei

Polypersona: Persona-Grounded LLM for Synthetic Survey Responses

Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu

Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts

S. I. M. Adnan, Abrar Hameem, Shikha Anirban, Md. Saiful Islam, Md. Musfique Anwar

How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa

Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth

Ana Elvira Ojanguren López, Javier Martín Arista, Darío Metola Rodríguez

Enhanced Old English NER via Morphology-Aware Analysis, Cross-Germanic Transfer, and Domain-Specific Patterns

Javier Martín Arista, Darío Metola Rodríguez, Daniel B. Morris

Benchmarking LLM Optimization Strategies for Clinical NER: A Comparative Analysis of DSPy GEPA against Domain-Specific Transformers

Justin Varghese, Yi Shang

Arabic Prompts with English Tools: A Benchmark

Konstantin Kubrak, Ahmed El-Moselhy, Ammar Alsulami, Remaz Altuwaim, Hassan Ismail Fawaz, Faisal Alsaby

AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

Mo El-Haj, Paul Rayson

Temporal-Aware RAG for Multilingual ESG Document Retrieval: A Low-Resource Approach to Time-Sensitive Question Answering

Nguyen Anh Kiet Truong

UniFi-LLM: A Unified Large Language Model for Financial Data Generation and Fraud Prediction

Giridhar Pamisetty, Subbareddy Batreddy, Priya Verma, Sobhan Babu Chintapalli

Tackling Low-Resource K-12 Hand-Drawn Mathematics VQA: Unified Regularization with Compute-Aware Expert Token Architecture

Hai Li, Wanli Xing, Chenglu Li, Bailing Lyu

Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis

Felipe Ribeiro Fujita de Mello, Hideyuki Takada

Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models

Haomin Qi, Chengbo Huang, Zihan Dai, Yunkai Gao

Leveraging LLM Agents for Autonomous Web Penetration Testing Targeting SQL Injection Vulnerability

Thanh Phong Tran, Le Bao Phuc Nguyen, Trong Nghia To, Van Hau Pham, The Duy Phan

XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs

Iñaki Lacunza, José Javier Saiz, Alexander Shvets, Aitor Gonzalez-Agirre, Marta Villegas

Arabic OCR in the Age of Multimodal Models: A Comprehensive Comparative Evaluation

Hossam Elsafty, Farizeh Aldabbas, Rafet Sifa

Anna Arnaudo, Riccardo Coppola, Maurizio Morisio, Antonio Vetrò, Maurizio Borghi, Bryan Khan, Riccardo Raso

Cluster-aware Item Prompt Learning for Session-based Recommendation

Wooseong Yang, Chen Wang, Zihe Song, Weizhi Zhang, Philip S. Yu

Fine-tuning Large-Language-Models using Federated Learning & Blockchain

Soham Ratnaparkhi, Saeed Samet

Targeted Knowledge Enhancement: A Systematic Continual Pre-training Approach for Effective Domain Adaptation

Yiqun Wang, Chaoqun Wan, Xiang Tian, Xuesong Liu, Yaowu Chen

Workshop Schedule (10 December 2025)

Time	Title	Presenter
09:00–09:15	Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts	Md. Saiful Islam
09:15–09:30	Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth	Ana Elvira Ojanguren López
09:30–09:45	Polypersona: Persona-Grounded LLM for Synthetic Survey Responses	Anudeep Vurity
09:45–10:00	IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline	Cao Lei
10:00–10:15	Benchmarking LLM Optimisation Strategies for Clinical NER	Justin Varghese
10:15–10:30	Targeted Knowledge Enhancement	Yiqun Wang
10:30–11:00	Coffee break
11:00–11:15	Enhanced Old English NER	Javier Martín Arista
11:15–11:30	Arabic Prompts with English Tools: A Benchmark	Konstantin Kubrak
11:30–11:45	UniFi-LLM	Giridhar Pamisetty
11:45–12:00	Federated Learning & Blockchain	Soham Ratnaparkhi
12:00–12:15	LLM Agents for Web Penetration Testing	Thanh Phong Tran
12:15–12:30	Temporal-Aware RAG for Multilingual ESG	Nguyen Anh Kiet Truong
12:30–14:00	Lunch break
14:00–14:15	AraFinNews: Arabic Financial Summarisation	Mo El-Haj
14:15–14:30	XDoGE: Multilingual Data Reweighting	Iñaki Lacunza
14:30–14:45	Copyright Infringement in Generative AI	Anna Arnaudo
14:45–15:00	Arabic OCR in the Age of Multimodal Models	Hossam Elsafty
15:00–15:15	Compact LMs for On-Device MT Error Detection	Muskaan Chopra
15:15–15:30	Better Data Selection for LLM Fine-Tuning	Felipe Ribeiro Fujita de Mello
15:30–16:00	Coffee break
16:00–16:15	Governance-Aware Hybrid Fine-Tuning	Haomin Qi
16:15–16:30	Low-Resource K-12 Math VQA	Hai Li
16:30–16:45	Cluster-aware Item Prompt Learning	Wooseong Yang

Time

Title

Presenter

09:00–09:15

Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts

Md. Saiful Islam

09:15–09:30

Scaling Classical NLP Pipelines for Under-Resourced Old English: Character-Level Models, Unsupervised Pretraining, and Supervised Data Growth

Ana Elvira Ojanguren López

09:30–09:45

Polypersona: Persona-Grounded LLM for Synthetic Survey Responses

Anudeep Vurity

09:45–10:00

IDR-RAG: An Iterative Draft-Revision Agent-like RAG Pipeline

Cao Lei

10:00–10:15

Benchmarking LLM Optimisation Strategies for Clinical NER

Justin Varghese

10:15–10:30

Targeted Knowledge Enhancement

Yiqun Wang

10:30–11:00

Coffee break

11:00–11:15

Enhanced Old English NER

Javier Martín Arista

11:15–11:30

Arabic Prompts with English Tools: A Benchmark

Konstantin Kubrak

11:30–11:45

UniFi-LLM

Giridhar Pamisetty

11:45–12:00

Federated Learning & Blockchain

Soham Ratnaparkhi

12:00–12:15

LLM Agents for Web Penetration Testing

Thanh Phong Tran

12:15–12:30

Temporal-Aware RAG for Multilingual ESG

Nguyen Anh Kiet Truong

12:30–14:00

Lunch break

14:00–14:15

AraFinNews: Arabic Financial Summarisation

Mo El-Haj

14:15–14:30

XDoGE: Multilingual Data Reweighting

Iñaki Lacunza

14:30–14:45

Anna Arnaudo

14:45–15:00

Arabic OCR in the Age of Multimodal Models

Hossam Elsafty

15:00–15:15

Compact LMs for On-Device MT Error Detection

Muskaan Chopra

15:15–15:30

Better Data Selection for LLM Fine-Tuning

Felipe Ribeiro Fujita de Mello

15:30–16:00

Coffee break

16:00–16:15

Governance-Aware Hybrid Fine-Tuning

Haomin Qi

16:15–16:30

Low-Resource K-12 Math VQA

Hai Li

16:30–16:45

Cluster-aware Item Prompt Learning

Wooseong Yang