publications | Shahriyar Zaman Ridoy

2026

ACL ARR

From Language Specifications to Executable Turing Machines: Evaluating LLMs as Computational Machine Designers

Shahriyar Zaman Ridoy, S. M. Muhtasimul Hasan, Azmine Toushik Wasi, and 3 more authors

2026

ACL ARR 2026 May Submission (Under Review)
ACL ARR

Can LLMs Design Computational Machines? Pushdown Automaton Synthesis as a Test of Structured Computational Reasoning

Syed Mumtahin Mahmud, Nazira Jesmin Lina, Shahriyar Zaman Ridoy, and 2 more authors

2026

ACL ARR 2026 March Submission; to be committed to EMNLP 2026
ACL ARR

Can LLMs Follow the Pulse of a Crisis? Evaluating Crisis Sentiment in Bangladesh’s July Uprising

Md. Samiul Alim, Mahir Shahriar Tamim, Tanvir Ahmed Khan, and 4 more authors

2026

ACL ARR 2026 May Submission (Under Review)
ACL ARR

Beyond Build Validity: Evaluating Behavioral Correctness in LLM-Generated Context-Free Grammars

S. M. Muhtasimul Hasan^*, Shahriyar Zaman Ridoy^*, Syed Mumtahin Mahmud, and 2 more authors

2026

ACL ARR 2026 May Submission (Under Review)
ACL ARR

Geo-spatial and Geo-temporal Reasoning in Vision–Language and Large Language Models: A Review

Shahriyar Zaman Ridoy^*, Azmine Toushik Wasi^*, Koushik Ahamed Tonmoy, and 1 more author

2026

ACL ARR 2026 March Submission (Under Review)
NeurIPS

SEISMOS: A Statistical Signal Detection Framework for Semantic Chunking

Ishtiak Mahmud Saad, Mominul Islam, Shahriyar Zaman Ridoy, and 2 more authors

2026

NeurIPS 2026 Conference Submission (Under Review)
NeurIPS

InvariantBench: Can Large Language Models Exhibit Inherent Reasoning Across Equivalent Transformations?

Azmine Toushik Wasi, Mahir Absar Khan, Abdur Rahman, and 14 more authors

2026

NeurIPS 2026 Evaluations and Datasets Track Submission (Under Review)
ICML main track

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision–Language Models in Real-World Settings

Azmine Toushik Wasi^*, Shahriyar Zaman Ridoy^*, Koushik Ahamed Tonmoy, and 5 more authors

In International Conference on Machine Learning (ICML), 2026

Accepted

Abs PDF

Geo-temporal understanding, the ability to identify the location, time, and contextual features of an image from visual cues alone, is a fundamental aspect of human intelligence with wide-ranging applications, from disaster response to autonomous navigation and geography education. While recent vision–language models (VLMs) have shown progress in image geo-localization using conspicuous cues like landmarks or road signs, their ability to understand temporal signals and related spatial reasoning cues remains underexplored. To address this gap, we introduce \textbf\textscTimeSpot, a comprehensive benchmark for evaluating real-world geo-temporal reasoning in VLMs. \textscTimeSpot comprises 1,455 images spanning 80 countries, where models must infer temporal attributes (season, month, time of day, daylight phase) and geolocation attributes (continent, country, climate zone, environment type, latitude–longitude coordinates) directly from the visual input. In addition, it includes spatial reasoning tasks that require integrating geographical, spatial, and temporal cues to solve complex understanding problems. Unlike prior benchmarks that emphasize obvious cues or iconic imagery, \textscTimeSpot prioritizes diverse and subtle settings, reflecting the difficulty of reasoning under real-world uncertainty. Our evaluation of state-of-the-art VLMs, including both open- and closed-source models, reveals consistently low performance across tasks, highlighting substantial challenges in achieving robust temporal and geographic reasoning. These findings underscore the pressing need for improved methods to enable reliable and trustworthy geo-temporal understanding in VLMs, paving the way for future research in this critical domain.
ICLR

SpatiaLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, and 12 more authors

In International Conference on Learning Representations (ICLR), 2026

Accepted

Abs PDF

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision–language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce \textbf\textscSpatiaLab, a comprehensive benchmark for evaluating VLMs’ spatial reasoning in realistic, unconstrained contexts. \textscSpatiaLab comprises 1,400 visual question–answer pairs across six major categories: \textitRelative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation.Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10–25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, \textscSpatiaLab exposes critical challenges and opportunities for advancing VLMs’ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding.
ACM FAccT

BengaliMoralBench: A Benchmark for Auditing Ethical Reasoning in Large Language Models in Bengali Language and Culture

Shahriyar Zaman Ridoy^*, Azmine Toushik Wasi^*, Koushik Ahamed Tonmoy, and 2 more authors

2026

Accepted at ACM FAccT 2026

Abs PDF

As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce \textbfBengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. \textbfBengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.
Position: Real-World Clinical NLP Robustness Requires Messy, Multimodal, Longitudinal, Privacy-Preserving Corpora

Azmine Toushik Wasi^* and Shahriyar Zaman Ridoy^*

2026

EurIPS 2025 Workshop MMRL4H

Abs PDF

Real-world healthcare data is inherently complex, riddled with noise, incompleteness, heterogeneity, and temporal irregularities, making the development of clinically robust AI systems particularly challenging. Traditional AI models in healthcare are often developed on idealized, clean datasets that fail to reflect the operational realities of clinical settings. This limits their generalizability and effectiveness when deployed in real-world environments. There is a critical need for a paradigm shift toward methodologies that embrace the "messiness" of clinical data while ensuring patient privacy and ethical integrity. Current models and pipelines are ill-equipped to manage this complexity in a scalable, trustworthy manner. We introduce the concept of the "\textbf\textitMessy Clinic" dataset and a corresponding blueprint designed to guide the development of healthcare AI systems grounded in real-world data characteristics. This framework supports seamless multimodal data integration, spanning EHRs, imaging, genomics, and wearables, while incorporating advanced privacy-preserving techniques such as federated learning, synthetic data generation, and differential privacy. The blueprint offers (1) a structured methodology for longitudinal, multimodal data management; (2) robust AI/ML techniques tailored to noisy and incomplete inputs; and (3) a comprehensive governance framework addressing ethical concerns like consent, accountability, and bias mitigation. Embracing the "\textbf\textitMessy Clinic" paradigm represents a foundational shift in healthcare AI development, from artificial idealism to real-world robustness. This approach promises to accelerate medical research, enable personalized care, and ultimately transform healthcare delivery by aligning AI systems with the authentic complexity of clinical practice.
ICML workshop

When Machines Decide If a Human Wrote It: Creativity in the Age of AI Detectors

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, and Md Manjurul Ahsan

In ICML 2026 Workshop on Human-AI Co-Creativity, 2026

Workshop paper

Abs

AI detectors, originally developed to flag machine-generated content, are now widely used to evaluate human writing in educational and creative contexts. By relying on rigid linguistic metrics and statistical norms, these systems subtly reshape human expression through algorithmic conformity. Acting as gatekeepers of authentic expression, they privilege conformity over originality and disproportionately penalize non-native English speakers and marginalized voices, causing psychological harm, academic penalties, and cultural erasure. This paper argues that such systems impose an algorithmic aesthetic that suppresses rebellion, hinders discovery, and diminishes the joy of creation. Reframing the issue as one of civil rights and human flourishing, we propose three interventions: restricting AI detectors in creative and educational spaces, promoting glitch aesthetics to valorize imperfection, and protecting creative anonymity to foster free experimentation. Drawing on legal and cultural policy precedents, we contend that these measures are essential to safeguarding human imagination. Ultimately, we advocate for institutional safeguards that champion risk, surprise, and dissent as vital components of a thriving creative future.

2025

EMNLP-Industry

CAPSTONE: Composable Attribute-Prompted Scene Translation for Zero-Shot Vision–Language Reasoning

Md. Ismail Hossain^*, Shahriyar Zaman Ridoy^*, Moshiur Farazi, and 2 more authors

In EMNLP Industry, 2025

Accepted

Abs PDF Code

Interpreting visual scenes with high-level reasoning is essential for many real-world applications, such as autonomous systems and content moderation, but training and maintaining Vision–Language Models (VLMs) remains resource-intensive and opaque. In this work, we present \textscCAPSTONE, a lightweight, modular framework designed for industrial settings. Instead of relying on multimodal training or fine-tuning large models, \textscCAPSTONE transforms outputs from off-the-shelf vision models into structured text prompts that can be interpreted by a frozen Large Language Model (LLM). This plug-and-play architecture enables reasoning over visual input without access to raw pixels, dramatically reducing computational cost and complexity. On the \textscPOPE dataset, our system, using a 7B LLM, outperforms several fully trained VLMs in zero-shot evaluations, while on the \textscVSR benchmark, the 4B model achieves competitive results, together demonstrating strong generalization without retraining. \textscCAPSTONE offers a scalable and interpretable alternative for companies looking to integrate visual reasoning capabilities without the burden of full-scale VLM pipelines.
Array

PANCHINI: Predictive Analytics for Child Marriage in Bangladesh Using Machine Learning Insights

Jawad Ibn Ahad, Zarin Akter, Shahriyar Zaman Ridoy, and 1 more author

2025

Under review at the Array journal

Abs

Child marriage remains a significant social issue, particularly in developing countries such as Bangladesh, where deep-rooted cultural traditions and socio-economic disparities perpetuate the practice despite legal restrictions and awareness campaigns. This study explores the underlying factors contributing to child marriage among Bangladeshi women, especially in rural and underdeveloped regions where traditional intervention strategies have struggled. By leveraging machine learning (ML) and natural language processing (NLP), we develop a predictive model to assess the likelihood of child marriage from individual demographic profiles. Our analysis focuses on key socio-economic, educational, and cultural factors, offering insights into regional variations across Bangladesh. We construct a novel dataset that combines macro- and micro-level socio-economic variables tailored to this problem. Model performance is evaluated across multiple algorithms, with ensemble Blending—particularly the EMNet configuration—achieving the strongest results, including accuracy (94.80%) and AUC (97.58%). We also evaluate large language models (LLMs) as prompt-based classifiers on relevant textual fields and as assistants for schema normalization and feature enrichment, benchmarking their outputs against supervised ML baselines. This data- and language-driven approach improves precision in identifying high-risk individuals and communities and yields actionable insights for policymakers. By integrating predictive analytics and LLM-based evaluation into intervention design, the work supports more targeted, effective policies to mitigate child marriage and promote long-term social empowerment.
SNCS

Context-Aware Data Cleaning: Optimizing Bengali Text for Contextual Text Classification

Moshiur Rahman Faisal, Abdur Rahman Fahad, Shahriyar Zaman Ridoy, and 5 more authors

SN Computer Science, 2025

Abs DOI PDF Code

In Natural Language Processing (NLP), textual data is foundational, yet it presents substantial challenges, especially for under-resourced languages like Bengali. The complexity and volume of Bengali textual data require sophisticated data cleaning techniques. Traditional methods often neglect critical contextual information essential for effective textual analysis. This study highlights the need for context-aware data cleaning, a methodology that maintains linguistic context while removing noise. The study compares context-aware and traditional data cleaning approaches tailored for Bengali text to improve the performance of contextual transformer-based models. Conventional techniques in this study include symbol and punctuation removal, stop-word elimination, stemming, and removing HTML tags or URLs. In contrast, context-aware techniques involve spelling correction, tagging HTML and URLs, preserving punctuation and emojis, and selectively removing less important words using TF-IDF. The current initiative assesses the impact of these strategies through rigorous dataset curation and extensive training in machine learning, deep learning, and transformer-based models on four prominent Bengali datasets: BEmoC, SentNoB, UBMEC, and EmoNoBa. Results show context-aware data cleaning significantly outperforms traditional methods, particularly in enhancing transformer-based model performance. The developed context-aware data cleaning pipeline integrates various techniques, achieving a baseline accuracy improvement of up to 4% across three of the four datasets. These findings underscore the importance of preserving sentence-level context in Bengali for optimal NLP performance while minimizing noise. Additionally, the research introduces a novel context-aware data cleaning pipeline and provides detailed algorithms for its implementation, advancing NLP research and applications in Bengali and similar linguistic contexts.

2024

IEEE BigData

EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code

Shahriyar Zaman Ridoy, Md. Shazzad Hossain Shaon, Alfredo Cuzzocrea, and 1 more author

In Proceedings of the IEEE International Conference on Big Data (IEEE BigData), 2024

Abs PDF Code

Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often strug- gle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding—CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.
IEEE IS(🏆Best Paper)

An Efficient Text Cleaning Pipeline for Clinical Text for Transformer Encoder Models

Shahriyar Zaman Ridoy, Jannat Sultana, Zinnat Fowzia Ria, and 3 more authors

In IEEE 12th International Conference on Intelligent Systems (IS), 2024

Awarded Abs PDF Code

https://drive.google.com/file/d/1Jx1ypVfoxjk7_C7jL1nq6U5x2kbybVVh/view

It might be challenging to choose the best text preprocessing strategy in the field of natural language processing (NLP) due to the variety of techniques available. Given the popularity of transformer models, we wondered if preprocessing was necessary and, if so, what methods would improve the models’ performance. Especially when working with clinical text data, accuracy is crucial. Our goal was to find an appropriate preprocessing pipeline for clinical texts that maintains or improves model performance. We experienced four common preprocessing techniques and their groupings on two datasets from MIMIC-3 and PubMed. We used four models: BERT base, BioBERT, BioClinicalBERT, and RoBERTa. The varied accuracy results from existing techniques inspired us to develop a new pipeline to improve accuracy. Our pipeline starts with removing repeated punctuation, normalizing the text with a CleanText function, and filtering less important words using TF-IDF scores to keep clinically applicable terms and moderate noise. Our results presented that our pipeline outperformed the base models. For the MIMIC-3 dataset, the BERT base model achieved 90.16% accuracy, and for the PubMed dataset, BioBERT achieved 64.20% accuracy. We also found that removing stop words decreased accuracy, while using TF-IDF either maintained or improved it up to 3%. Additionally, as we removed less important words from the documents our pipeline considerably reduced training time up to 17%.