Researcher profile

Faisal Ahmed

Faisal Ahmed contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Four-Stage Alzheimer's Disease Classification from MRI Using Topological Feature Extraction, Feature Selection, and Ensemble Learning

Accurate and efficient classification of Alzheimer's disease (AD) severity from brain magnetic resonance imaging (MRI) remains a critical challenge, particularly when limited data and model interpretability are of concern. In this work, we propose TDA-Alz, a novel framework for four-stage Alzheimer's disease severity classification (non-demented, moderate dementia, mild, and very mild) using topological data analysis (TDA) and ensemble learning. Instead of relying on deep convolutional architectures or extensive data augmentation, our approach extracts topological descriptors that capture intrinsic structural patterns of brain MRI, followed by feature selection to retain the most discriminative topological features. These features are then classified using an ensemble learning strategy to achieve robust multiclass discrimination. Experiments conducted on the OASIS-1 MRI dataset demonstrate that the proposed method achieves an accuracy of 98.19% and an AUC of 99.75%, outperforming or matching state-of-the-art deep learning--based methods reported on OASIS and OASIS-derived datasets. Notably, the proposed framework does not require data augmentation, pretrained networks, or large-scale computational resources, making it computationally efficient and fast compared to deep neural network approaches. Furthermore, the use of topological descriptors provides greater interpretability, as the extracted features are directly linked to the underlying structural characteristics of brain MRI rather than opaque latent representations. These results indicate that TDA-Alz offers a powerful, lightweight, and interpretable alternative to deep learning models for MRI-based Alzheimer's disease severity classification, with strong potential for real-world clinical decision-support systems.

preprint2024arXiv

Broadband miniaturized spectrometers with a van der Waals tunnel diode

Miniaturized spectrometers are of immense interest for various on-chip and implantable photonic and optoelectronic applications. State-of-the-art conventional spectrometer designs rely heavily on bulky dispersive components (such as gratings, photodetector arrays, and interferometric optics) to capture different input spectral components that increase their integration complexity. Here, we report a high-performance broadband spectrometer based on a simple and compact van der Waals heterostructure diode, leveraging a careful selection of active van der Waals materials -- molybdenum disulfide and black phosphorus, their electrically tunable photoresponse, and advanced computational algorithms for spectral reconstruction. We achieve remarkably high peak wavelength accuracy of ~2 nanometers, and broad operation bandwidth spanning from ~500 to 1600 nanometers in a device with a ~30x20 μm2 footprint. This diode-based spectrometer scheme with broadband operation offers an attractive pathway for various applications, such as sensing, surveillance and spectral imaging.

preprint2022arXiv

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets. Code is available at https://github.com/microsoft/SwinBERT

preprint2022arXiv

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text and box outputs with a shared token sequence, and introduces a special <obj> token to naturally indicate word-box alignments in the sequence. UniTAB thus could provide a more comprehensive and interpretable image description, by freely grounding generated words to object regions. On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations. On general VL tasks that have different desired output formats (i.e., text, box, or their combination), UniTAB with a single network achieves better or comparable performance than task-specific state of the art. Experiments cover 7 VL benchmarks, including grounded captioning, visual grounding, image captioning, and visual question answering. Furthermore, UniTAB&#39;s unified multi-task network and the task-agnostic output sequence design make the model parameter efficient and generalizable to new tasks.

preprint2020arXiv

UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.