publications
publications by categories in reversed chronological order. * denotes joint first author.
2026
- ACL ARR
Can LLMs Design Computational Machines? Pushdown Automaton Synthesis as a Test of Structured Computational ReasoningSyed Mumtahin Mahmud, Nazira Jesmin Lina, Shahriyar Zaman Ridoy, and 2 more authors2026ACL ARR 2026 March Submission (Under Review) - ACL ARR
Geo-spatial and Geo-temporal Reasoning in Vision–Language and Large Language Models: A ReviewShahriyar Zaman Ridoy*, Azmine Toushik Wasi*, Koushik Ahamed Tonmoy, and 1 more author2026ACL ARR 2026 March Submission (Under Review) - ACL ARR
CFGBench: Evaluating Language Correctness Beyond Syntactic Validity in LLM-Generated Context-Free GrammarsS. M. Muhtasimul Hasan*, Shahriyar Zaman Ridoy*, Syed Mumtahin Mahmud, and 2 more authors2026ACL ARR 2026 March Submission (Under Review) - ICML main track
TimeSpot: Benchmarking Geo-Temporal Understanding in Vision–Language Models in Real-World SettingsAzmine Toushik Wasi*, Shahriyar Zaman Ridoy*, Koushik Ahamed Tonmoy, and 5 more authorsIn International Conference on Machine Learning (ICML), 2026AcceptedGeo-temporal understanding, the ability to identify the location, time, and contextual features of an image from visual cues alone, is a fundamental aspect of human intelligence with wide-ranging applications, from disaster response to autonomous navigation and geography education. While recent vision–language models (VLMs) have shown progress in image geo-localization using conspicuous cues like landmarks or road signs, their ability to understand temporal signals and related spatial reasoning cues remains underexplored. To address this gap, we introduce \textbf\textscTimeSpot, a comprehensive benchmark for evaluating real-world geo-temporal reasoning in VLMs. \textscTimeSpot comprises 1,455 images spanning 80 countries, where models must infer temporal attributes (season, month, time of day, daylight phase) and geolocation attributes (continent, country, climate zone, environment type, latitude–longitude coordinates) directly from the visual input. In addition, it includes spatial reasoning tasks that require integrating geographical, spatial, and temporal cues to solve complex understanding problems. Unlike prior benchmarks that emphasize obvious cues or iconic imagery, \textscTimeSpot prioritizes diverse and subtle settings, reflecting the difficulty of reasoning under real-world uncertainty. Our evaluation of state-of-the-art VLMs, including both open- and closed-source models, reveals consistently low performance across tasks, highlighting substantial challenges in achieving robust temporal and geographic reasoning. These findings underscore the pressing need for improved methods to enable reliable and trustworthy geo-temporal understanding in VLMs, paving the way for future research in this critical domain.
- ICLR
SpatiaLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, and 12 more authorsIn International Conference on Learning Representations (ICLR), 2026AcceptedSpatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision–language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce \textbf\textscSpatiaLab, a comprehensive benchmark for evaluating VLMs’ spatial reasoning in realistic, unconstrained contexts. \textscSpatiaLab comprises 1,400 visual question–answer pairs across six major categories: \textitRelative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation.Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10–25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, \textscSpatiaLab exposes critical challenges and opportunities for advancing VLMs’ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding.
- ACM FAccT
BengaliMoralBench: A Benchmark for Auditing Ethical Reasoning in Large Language Models in Bengali Language and CultureShahriyar Zaman Ridoy*, Azmine Toushik Wasi*, Koushik Ahamed Tonmoy, and 2 more authors2026Accepted at ACM FAccT 2026As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce \textbfBengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. \textbfBengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.
-
Position: Real-World Clinical NLP Robustness Requires Messy, Multimodal, Longitudinal, Privacy-Preserving CorporaAzmine Toushik Wasi* and Shahriyar Zaman Ridoy*2026EurIPS 2025 Workshop MMRL4HReal-world healthcare data is inherently complex, riddled with noise, incompleteness, heterogeneity, and temporal irregularities, making the development of clinically robust AI systems particularly challenging. Traditional AI models in healthcare are often developed on idealized, clean datasets that fail to reflect the operational realities of clinical settings. This limits their generalizability and effectiveness when deployed in real-world environments. There is a critical need for a paradigm shift toward methodologies that embrace the "messiness" of clinical data while ensuring patient privacy and ethical integrity. Current models and pipelines are ill-equipped to manage this complexity in a scalable, trustworthy manner. We introduce the concept of the "\textbf\textitMessy Clinic" dataset and a corresponding blueprint designed to guide the development of healthcare AI systems grounded in real-world data characteristics. This framework supports seamless multimodal data integration, spanning EHRs, imaging, genomics, and wearables, while incorporating advanced privacy-preserving techniques such as federated learning, synthetic data generation, and differential privacy. The blueprint offers (1) a structured methodology for longitudinal, multimodal data management; (2) robust AI/ML techniques tailored to noisy and incomplete inputs; and (3) a comprehensive governance framework addressing ethical concerns like consent, accountability, and bias mitigation. Embracing the "\textbf\textitMessy Clinic" paradigm represents a foundational shift in healthcare AI development, from artificial idealism to real-world robustness. This approach promises to accelerate medical research, enable personalized care, and ultimately transform healthcare delivery by aligning AI systems with the authentic complexity of clinical practice.
2025
- EMNLP-Industry
CAPSTONE: Composable Attribute-Prompted Scene Translation for Zero-Shot Vision–Language ReasoningMd. Ismail Hossain*, Shahriyar Zaman Ridoy*, Moshiur Farazi, and 2 more authorsIn EMNLP Industry, 2025AcceptedInterpreting visual scenes with high-level reasoning is essential for many real-world applications, such as autonomous systems and content moderation, but training and maintaining Vision–Language Models (VLMs) remains resource-intensive and opaque. In this work, we present \textscCAPSTONE, a lightweight, modular framework designed for industrial settings. Instead of relying on multimodal training or fine-tuning large models, \textscCAPSTONE transforms outputs from off-the-shelf vision models into structured text prompts that can be interpreted by a frozen Large Language Model (LLM). This plug-and-play architecture enables reasoning over visual input without access to raw pixels, dramatically reducing computational cost and complexity. On the \textscPOPE dataset, our system, using a 7B LLM, outperforms several fully trained VLMs in zero-shot evaluations, while on the \textscVSR benchmark, the 4B model achieves competitive results, together demonstrating strong generalization without retraining. \textscCAPSTONE offers a scalable and interpretable alternative for companies looking to integrate visual reasoning capabilities without the burden of full-scale VLM pipelines.
- Array
PANCHINI: Predictive Analytics for Child Marriage in Bangladesh Using Machine Learning InsightsJawad Ibn Ahad, Zarin Akter, Shahriyar Zaman Ridoy, and 1 more author2025Under review at the Array journalChild marriage remains a significant social issue, particularly in developing countries such as Bangladesh, where deep-rooted cultural traditions and socio-economic disparities perpetuate the practice despite legal restrictions and awareness campaigns. This study explores the underlying factors contributing to child marriage among Bangladeshi women, especially in rural and underdeveloped regions where traditional intervention strategies have struggled. By leveraging machine learning (ML) and natural language processing (NLP), we develop a predictive model to assess the likelihood of child marriage from individual demographic profiles. Our analysis focuses on key socio-economic, educational, and cultural factors, offering insights into regional variations across Bangladesh. We construct a novel dataset that combines macro- and micro-level socio-economic variables tailored to this problem. Model performance is evaluated across multiple algorithms, with ensemble Blending—particularly the EMNet configuration—achieving the strongest results, including accuracy (94.80%) and AUC (97.58%). We also evaluate large language models (LLMs) as prompt-based classifiers on relevant textual fields and as assistants for schema normalization and feature enrichment, benchmarking their outputs against supervised ML baselines. This data- and language-driven approach improves precision in identifying high-risk individuals and communities and yields actionable insights for policymakers. By integrating predictive analytics and LLM-based evaluation into intervention design, the work supports more targeted, effective policies to mitigate child marriage and promote long-term social empowerment.
- SNCS
Context-Aware Data Cleaning: Optimizing Bengali Text for Contextual Text ClassificationMoshiur Rahman Faisal, Abdur Rahman Fahad, Shahriyar Zaman Ridoy, and 5 more authorsSN Computer Science, 2025In Natural Language Processing (NLP), textual data is foundational, yet it presents substantial challenges, especially for under-resourced languages like Bengali. The complexity and volume of Bengali textual data require sophisticated data cleaning techniques. Traditional methods often neglect critical contextual information essential for effective textual analysis. This study highlights the need for context-aware data cleaning, a methodology that maintains linguistic context while removing noise. The study compares context-aware and traditional data cleaning approaches tailored for Bengali text to improve the performance of contextual transformer-based models. Conventional techniques in this study include symbol and punctuation removal, stop-word elimination, stemming, and removing HTML tags or URLs. In contrast, context-aware techniques involve spelling correction, tagging HTML and URLs, preserving punctuation and emojis, and selectively removing less important words using TF-IDF. The current initiative assesses the impact of these strategies through rigorous dataset curation and extensive training in machine learning, deep learning, and transformer-based models on four prominent Bengali datasets: BEmoC, SentNoB, UBMEC, and EmoNoBa. Results show context-aware data cleaning significantly outperforms traditional methods, particularly in enhancing transformer-based model performance. The developed context-aware data cleaning pipeline integrates various techniques, achieving a baseline accuracy improvement of up to 4% across three of the four datasets. These findings underscore the importance of preserving sentence-level context in Bengali for optimal NLP performance while minimizing noise. Additionally, the research introduces a novel context-aware data cleaning pipeline and provides detailed algorithms for its implementation, advancing NLP research and applications in Bengali and similar linguistic contexts.
2024
- IEEE BigData
EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source CodeShahriyar Zaman Ridoy, Md. Shazzad Hossain Shaon, Alfredo Cuzzocrea, and 1 more authorIn Proceedings of the IEEE International Conference on Big Data (IEEE BigData), 2024Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often strug- gle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding—CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.
- IEEE IS(🏆Best Paper)
An Efficient Text Cleaning Pipeline for Clinical Text for Transformer Encoder ModelsShahriyar Zaman Ridoy, Jannat Sultana, Zinnat Fowzia Ria, and 3 more authorsIn IEEE 12th International Conference on Intelligent Systems (IS), 2024https://drive.google.com/file/d/1Jx1ypVfoxjk7_C7jL1nq6U5x2kbybVVh/view
It might be challenging to choose the best text preprocessing strategy in the field of natural language processing (NLP) due to the variety of techniques available. Given the popularity of transformer models, we wondered if preprocessing was necessary and, if so, what methods would improve the models’ performance. Especially when working with clinical text data, accuracy is crucial. Our goal was to find an appropriate preprocessing pipeline for clinical texts that maintains or improves model performance. We experienced four common preprocessing techniques and their groupings on two datasets from MIMIC-3 and PubMed. We used four models: BERT base, BioBERT, BioClinicalBERT, and RoBERTa. The varied accuracy results from existing techniques inspired us to develop a new pipeline to improve accuracy. Our pipeline starts with removing repeated punctuation, normalizing the text with a CleanText function, and filtering less important words using TF-IDF scores to keep clinically applicable terms and moderate noise. Our results presented that our pipeline outperformed the base models. For the MIMIC-3 dataset, the BERT base model achieved 90.16% accuracy, and for the PubMed dataset, BioBERT achieved 64.20% accuracy. We also found that removing stop words decreased accuracy, while using TF-IDF either maintained or improved it up to 3%. Additionally, as we removed less important words from the documents our pipeline considerably reduced training time up to 17%.