Researcher profile

Sarwan Ali

Sarwan Ali contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy ($0.93$ vs $0.91$); (ii)~holding total regularization budget constant, placing it in the middle of training yields $5{-}9\times$ higher OOD accuracy than placing it early; (iii)~the boundary of the critical window is remarkably sharp, window onset shifted by as little as $100$ optimization steps causes mean OOD to jump from chance ($0.15$) to reasoning-regime ($0.61$); (iv)~the window's position depends systematically on initialization scale, but the basin of attraction for reasoning solutions \emph{shrinks} at small initialization, contradicting the prevailing recommendation that smaller initialization is uniformly better. We further show that the critical-window phenomenon is task-specific: it does not appear on grokking with modular arithmetic, where properly tuned constant weight decay matches scheduled weight decay.

preprint2022arXiv

Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.

preprint2022arXiv

Efficient Approximate Kernel Based Spike Sequence Classification

Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods -- they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.

preprint2022arXiv

Information We Can Extract About a User From 'One Minute Mobile Application Usage'

Understanding human behavior is an important task and has applications in many domains such as targeted advertisement, health analytics, security, and entertainment, etc. For this purpose, designing a system for activity recognition (AR) is important. However, since every human can have different behaviors, understanding and analyzing common patterns become a challenging task. Since smartphones are easily available to every human being in the modern world, using them to track the human activities becomes possible. In this paper, we extracted different human activities using accelerometer, magnetometer, and gyroscope sensors of android smartphones by building an android mobile applications. Using different social media applications, such as Facebook, Instagram, Whatsapp, and Twitter, we extracted the raw sensor values along with the attributes of $29$ subjects along with their attributes (class labels) such as age, gender, and left/right/both hands application usage. We extract features from the raw signals and use them to perform classification using different machine learning (ML) algorithms. Using statistical analysis, we show the importance of different features towards the prediction of class labels. In the end, we use the trained ML model on our data to extract unknown features from a well known activity recognition data from UCI repository, which highlights the potential of privacy breach using ML models. This security analysis could help researchers in future to take appropriate steps to preserve the privacy of human subjects.

preprint2022arXiv

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences

COVID-19 pandemic, is still unknown and is an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona-) viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among avians, bats, camels, swines, humans and weasels, to name a few. We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call PWM2Vec, and use to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications such as determining protein function, or identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs in the context of host classification from viral sequences to generate a fixed-length feature vector representation. The results on the real world data show that in using PWM2Vec, we are able to perform comparably well as compared to baseline models. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus.

preprint2022arXiv

Short-Term Load Forecasting Using AMI Data

Accurate short-term load forecasting is essential for the efficient operation of the power sector. Forecasting load at a fine granularity such as hourly loads of individual households is challenging due to higher volatility and inherent stochasticity. At the aggregate levels, such as monthly load at a grid, the uncertainties and fluctuations are averaged out; hence predicting load is more straightforward. This paper proposes a method called Forecasting using Matrix Factorization (\textsc{fmf}) for short-term load forecasting (\textsc{stlf}). \textsc{fmf} only utilizes historical data from consumers' smart meters to forecast future loads (does not use any non-calendar attributes, consumers' demographics or activity patterns information, etc.) and can be applied to any locality. A prominent feature of \textsc{fmf} is that it works at any level of user-specified granularity, both in the temporal (from a single hour to days) and spatial dimensions (a single household to groups of consumers). We empirically evaluate \textsc{fmf} on three benchmark datasets and demonstrate that it significantly outperforms the state-of-the-art methods in terms of load forecasting. The computational complexity of \textsc{fmf} is also substantially less than known methods for \textsc{stlf} such as long short-term memory neural networks, random forest, support vector machines, and regression trees.

preprint2022arXiv

SsAG: Summarization and sparsification of Attributed Graphs

We present SsAG, an efficient and scalable lossy graph summarization method that retains the essential structure of the original graph. SsAG computes a sparse representation (summary) of the input graph and also caters to graphs with node attributes. The summary of a graph $G$ is stored as a graph on supernodes (subsets of vertices of $G$), and a weighted superedge connects two supernodes. The proposed method constructs a summary graph on $k$ supernodes that minimize the reconstruction error (difference between the original graph and the graph reconstructed from the summary) and maximum homogeneity with respect to attributes. We construct the summary by iteratively merging a pair of nodes. We derive a closed-form expression to efficiently compute the reconstruction error after merging a pair and approximate this score in constant time. To reduce the search space for selecting the best pair for merging, we assign a weight to each supernode that closely quantifies the contribution of the node in the score of the pairs containing it. We choose the best pair for merging from a random sample of supernodes selected with probability proportional to their weights. A logarithmic-sized sample yields a comparable summary based on various quality measures with weighted sampling. We propose a sparsification step for the constructed summary to reduce the storage cost to a given target size with a marginal increase in reconstruction error. Empirical evaluation on several real-world graphs and comparison with state-of-the-art methods shows that SsAG is up to $5\times$ faster and generates summaries of comparable quality.

preprint2021arXiv

Predicting Attributes of Nodes Using Network Structure

In many graphs such as social networks, nodes have associated attributes representing their behavior. Predicting node attributes in such graphs is an important problem with applications in many domains like recommendation systems, privacy preservation, and targeted advertisement. Attributes values can be predicted by analyzing patterns and correlations among attributes and employing classification/regression algorithms. However, these approaches do not utilize readily available network topology information. In this regard, interconnections between different attributes of nodes can be exploited to improve the prediction accuracy. In this paper, we propose an approach to represent a node by a feature map with respect to an attribute $a_i$ (which is used as input for machine learning algorithms) using all attributes of neighbors to predict attributes values for $a_i$. We perform extensive experimentation on ten real-world datasets and show that the proposed feature map significantly improves the prediction accuracy as compared to baseline approaches on these datasets.

preprint2020arXiv

Detecting DDoS Attack on SDN Due to Vulnerabilities in OpenFlow

Software Defined Networking (SDN) is a network paradigm shift that facilitates comprehensive network programmability to cope with emerging new technologies such as cloud computing and big data. SDN facilitates simplified and centralized network management enabling it to operate in dynamic scenarios. Further, SDN uses the OpenFlow protocol for communication between the controller and its switches. The OpenFlow creates vulnerabilities for network attacks especially Distributed Denial of Service (DDoS). DDoS attacks are launched from the compromised hosts connected to the SDN switches. In this paper, we introduce a time- and space-efficient solution for the identification of these compromised hosts. Our solution consumes less computational resources and space and does not require any special equipment.

preprint2020arXiv

Effect of Analysis Window and Feature Selection on Classification of Hand Movements Using EMG Signal

Electromyography (EMG) signals have been successfully employed for driving prosthetic limbs of a single or double degree of freedom. This principle works by using the amplitude of the EMG signals to decide between one or two simpler movements. This method underperforms as compare to the contemporary advances done at the mechanical, electronics, and robotics end, and it lacks intuition. Recently, research on myoelectric control based on pattern recognition (PR) shows promising results with the aid of machine learning classifiers. Using the approach termed as, EMG-PR, EMG signals are divided into analysis windows, and features are extracted for each window. These features are then fed to the machine learning classifiers as input. By offering multiple class movements and intuitive control, this method has the potential to power an amputated subject to perform everyday life movements. In this paper, we investigate the effect of the analysis window and feature selection on classification accuracy of different hand and wrist movements using time-domain features. We show that effective data preprocessing and optimum feature selection helps to improve the classification accuracy of hand movements. We use publicly available hand and wrist gesture dataset of $40$ intact subjects for experimentation. Results computed using different classification algorithms show that the proposed preprocessing and features selection outperforms the baseline and achieve up to $98\%$ classification accuracy.

preprint2020arXiv

Fair Allocation Based Soft Load Shedding

Renewable sources are taking center stage in electricity generation. Due to the intermittent nature of these renewable resources, the problem of the demand-supply gap arises. To solve this problem, several techniques have been proposed in the literature in terms of cost (adding peaker plants), availability of data (Demand Side Management "DSM"), hardware infrastructure (appliance controlling DSM) and safety (voltage reduction). However, these solutions are not fair in terms of electricity distribution. In many cases, although the available supply may not match the demand in peak hours, however, the total aggregated demand remains less than the total supply for the whole day. Load shedding (complete blackout) is a commonly used solution to deal with the demand-supply gap, which can cause substantial economic losses. To solve the demand-supply gap problem, we propose a solution called Soft Load Shedding (SLS), which assigns electricity quota to each household in a fair way. We measure the fairness of SLS by defining a function for household satisfaction level. We model the household utilities by parametric function and formulate the problem of SLS as a social welfare problem. We also consider revenue generated from the fair allocation as a performance measure. To evaluate our approach, extensive experiments have been performed on both synthetic and real-world datasets, and our model is compared with several baselines to show its effectiveness in terms of fair allocation and revenue generation.