Researcher profile

Srinivasan Parthasarathy

Srinivasan Parthasarathy contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

In sensitive domains, Retrieval-Augmented Generation (RAG) must be interpretable and robust because errors do not just mislead, they invite lawsuits, undermine scholarly credibility, and breach compliance. Stakeholders require traceable evidence, clear rationales for why specific evidence is selected, and safeguards against poisoned or misleading content. Yet current RAG pipelines rely on similarity-based retrieval with arbitrary top-k cutoffs, provide no explanation for selections, and remain vulnerable to poisoning attacks. We propose METEORA, which replaces these drawbacks with rationale-driven selection, using explicit reasoning to guide evidence choice, explain decisions, and improve robustness to RAG poisoning. METEORA operates in three stages: (1) a general-purpose LLM is preference-tuned to generate query-conditioned rationales using direct preference optimization; (2) these rationales drive an Evidence Chunk Selection Engine that pairs rationales with retrieved evidence for query-specific relevance and applies elbow detection to choose an adaptive cutoff (optionally expanding context with neighboring chunks); and (3) a Verifier LLM uses the rationales to detect and filter poisoned or misleading evidence before generation. Across six datasets, METEORA achieves 13.41% higher recall and, without expansion, 21.05% higher precision than the strongest baseline. It reduces the evidence needed for comparable recall by 80%, improving downstream answer accuracy by 33.34%, and strengthens adversarial defense by increasing F1 from 0.10 to 0.44. Code is available at: https://anonymous.4open.science/r/METEORA-DC46/README.md

preprint2022arXiv

A Deep Generative Model for Molecule Optimization via One Fragment Modification

Molecule optimization is a critical step in drug development to improve desired properties of drug candidates through chemical modification. We developed a novel deep generative model Modof over molecular graphs for molecule optimization. Modof modifies a given molecule through the prediction of a single site of disconnection at the molecule and the removal and/or addition of fragments at that site. A pipeline of multiple, identical Modof models is implemented into Modof-pipe to modify an input molecule at multiple disconnection sites. Here we show that Modof-pipe is able to retain major molecular scaffolds, allow controls over intermediate optimization steps and better constrain molecule similarities. Modof-pipe outperforms the state-of-the-art methods on benchmark datasets: without molecular similarity constraints, Modof-pipe achieves 81.2% improvement in octanol-water partition coefficient penalized by synthetic accessibility and ring size; and 51.2%, 25.6% and 9.2% improvement if the optimized molecules are at least 0.2, 0.4 and 0.6 similar to those before optimization, respectively. Modof-pipe is further enhanced into Modof-pipem to allow modifying one molecule to multiple optimized ones. Modof-pipem achieves additional performance improvement as at least 17.8% better than Modof-pipe.

preprint2022arXiv

Fairness-aware Summarization for Justified Decision-Making

In consequential domains such as recidivism prediction, facility inspection, and benefit assignment, it's important for individuals to know the decision-relevant information for the model's prediction. In addition, predictions should be fair both in terms of the outcome and the justification of the outcome. In other words, decision-relevant features should provide sufficient information for the predicted outcome and should be independent of the membership of individuals in protected groups such as race and gender. In this work, we focus on the problem of (un)fairness in the justification of the text-based neural models. We tie the explanatory power of the model to fairness in the outcome and propose a fairness-aware summarization mechanism to detect and counteract the bias in such models. Given a potentially biased natural language explanation for a decision, we use a multi-task neural model and an attribution mechanism based on integrated gradients to extract high-utility and low-bias justifications in form of a summary. The extracted summary is then used for training a model to make decisions for individuals. Results on several real world datasets suggest that our method drastically limits the demographic leakage in the input (fairness in justification) while moderately enhancing the fairness in the outcome. Our model is also effective in detecting and counteracting several types of data poisoning attacks that synthesize race-coded reasoning or irrelevant justifications.

preprint2022arXiv

M2: Mixed Models with Preferences, Popularities and Transitions for Next-Basket Recommendation

Next-basket recommendation considers the problem of recommending a set of items into the next basket that users will purchase as a whole. In this paper, we develop a novel mixed model with preferences, popularities and transitions (M2) for the next-basket recommendation. This method models three important factors in next-basket generation process: 1) users' general preferences, 2) items' global popularities and 3) transition patterns among items. Unlike existing recurrent neural network-based approaches, M2 does not use the complicated networks to model the transitions among items, or generate embeddings for users. Instead, it has a simple encoder-decoder based approach (ed-Trans) to better model the transition patterns among items. We compared M2 with different combinations of the factors with 5 state-of-the-art next-basket recommendation methods on 4 public benchmark datasets in recommending the first, second and third next basket. Our experimental results demonstrate that M2 significantly outperforms the state-of-the-art methods on all the datasets in all the tasks, with an improvement of up to 22.1%. In addition, our ablation study demonstrates that the ed-Trans is more effective than recurrent neural networks in terms of the recommendation performance. We also have a thorough discussion on various experimental protocols and evaluation metrics for next-basket recommendation evaluation.

preprint2022arXiv

MultiBiSage: A Web-Scale Recommendation System Using Multiple Bipartite Graphs at Pinterest

Graph Convolutional Networks (GCN) can efficiently integrate graph structure and node features to learn high-quality node embeddings. These embeddings can then be used for several tasks such as recommendation and search. At Pinterest, we have developed and deployed PinSage, a data-efficient GCN that learns pin embeddings from the Pin-Board graph. The Pin-Board graph contains pin and board entities and the graph captures the pin belongs to a board interaction. However, there exist several entities at Pinterest such as users, idea pins, creators, and there exist heterogeneous interactions among these entities such as add-to-cart, follow, long-click. In this work, we show that training deep learning models on graphs that captures these diverse interactions would result in learning higher-quality pin embeddings than training PinSage on only the Pin-Board graph. To that end, we model the diverse entities and their diverse interactions through multiple bipartite graphs and propose a novel data-efficient MultiBiSage model. MultiBiSage can capture the graph structure of multiple bipartite graphs to learn high-quality pin embeddings. We take this pragmatic approach as it allows us to utilize the existing infrastructure developed at Pinterest -- such as Pixie system that can perform optimized random-walks on billion node graphs, along with existing training and deployment workflows. We train MultiBiSage on six bipartite graphs including our Pin-Board graph. Our offline metrics show that MultiBiSage significantly outperforms the deployed latest version of PinSage on multiple user engagement metrics.

preprint2022arXiv

UBERT: A Novel Language Model for Synonymy Prediction at Scale in the UMLS Metathesaurus

The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.

preprint2021arXiv

Driving Style Representation in Convolutional Recurrent Neural Network Model of Driver Identification

Identifying driving styles is the task of analyzing the behavior of drivers in order to capture variations that will serve to discriminate different drivers from each other. This task has become a prerequisite for a variety of applications, including usage-based insurance, driver coaching, driver action prediction, and even in designing autonomous vehicles; because driving style encodes essential information needed by these applications. In this paper, we present a deep-neural-network architecture, we term D-CRNN, for building high-fidelity representations for driving style, that combine the power of convolutional neural networks (CNN) and recurrent neural networks (RNN). Using CNN, we capture semantic patterns of driver behavior from trajectories (such as a turn or a braking event). We then find temporal dependencies between these semantic patterns using RNN to encode driving style. We demonstrate the effectiveness of these techniques for driver identification by learning driving style through extensive experiments conducted on several large, real-world datasets, and comparing the results with the state-of-the-art deep-learning and non-deep-learning solutions. These experiments also demonstrate a useful example of bias removal, by presenting how we preprocess the input data by sampling dissimilar trajectories for each driver to prevent spatial memorization. Finally, this paper presents an analysis of the contribution of different attributes for driver identification; we find that engine RPM, Speed, and Acceleration are the best combination of features.

preprint2021arXiv

HAM: Hybrid Associations Models for Sequential Recommendation

Sequential recommendation aims to identify and recommend the next few items for a user that the user is most likely to purchase/review, given the user's purchase/rating trajectories. It becomes an effective tool to help users select favorite items from a variety of options. In this manuscript, we developed hybrid associations models (HAM) to generate sequential recommendations using three factors: 1) users' long-term preferences, 2) sequential, high-order and low-order association patterns in the users' most recent purchases/ratings, and 3) synergies among those items. HAM uses simplistic pooling to represent a set of items in the associations, and element-wise product to represent item synergies of arbitrary orders. We compared HAM models with the most recent, state-of-the-art methods on six public benchmark datasets in three different experimental settings. Our experimental results demonstrate that HAM models significantly outperform the state of the art in all the experimental settings, with an improvement as much as 46.6%. In addition, our run-time performance comparison in testing demonstrates that HAM models are much more efficient than the state-of-the-art methods, and are able to achieve significant speedup as much as 139.7 folds.

preprint2020arXiv

DrugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings

Traditional relational databases contain a lot of latent semantic information that have largely remained untapped due to the difficulty involved in automatically extracting such information. Recent works have proposed unsupervised machine learning approaches to extract such hidden information by textifying the database columns and then projecting the text tokens onto a fixed dimensional semantic vector space. However, in certain databases, task-specific class labels may be available, which unsupervised approaches are unable to lever in a principled manner. Also, when embeddings are generated at individual token level, then column encoding of multi-token text column has to be computed by taking the average of the vectors of the tokens present in that column for any given row. Such averaging approach may not produce the best semantic vector representation of the multi-token text column, as observed while encoding paragraphs or documents in natural language processing domain. With these shortcomings in mind, we propose a supervised machine learning approach using a Bi-LSTM based sequence encoder to directly generate column encodings for multi-token text columns of the DrugBank database, which contains gold standard drug-drug interaction (DDI) labels. Our text data driven encoding approach achieves very high Accuracy on the supervised DDI prediction task for some columns and we use those supervised column encodings to simulate and evaluate the Analogy SQL queries on relational data to demonstrate the efficacy of our technique.

preprint2020arXiv

Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations

Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) prediction, protein-protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph embedding methods achieve promising results and deserve more attention in the future biomedical graph analysis. Compared with three state-of-the-art methods for DDAs, DDIs and protein function predictions, the recent graph embedding methods achieve competitive performance without using any biological features and the learned embeddings can be treated as complementary representations for the biological features. By summarizing the experimental results, we provide general guidelines for properly selecting graph embedding methods and setting their hyper-parameters for different biomedical tasks.

preprint2020arXiv

HPRA: Hyperedge Prediction using Resource Allocation

Many real-world systems involve higher-order interactions and thus demand complex models such as hypergraphs. For instance, a research article could have multiple collaborating authors, and therefore the co-authorship network is best represented as a hypergraph. In this work, we focus on the problem of hyperedge prediction. This problem has immense applications in multiple domains, such as predicting new collaborations in social networks, discovering new chemical reactions in metabolic networks, etc. Despite having significant importance, the problem of hyperedge prediction hasn't received adequate attention, mainly because of its inherent complexity. In a graph with $n$ nodes the number of potential edges is $\mathcal{O}(n^{2})$, whereas in a hypergraph, the number of potential hyperedges is $\mathcal{O}(2^{n})$. To avoid searching through such a huge space, current methods restrain the original problem in the following two ways. One class of algorithms assume the hypergraphs to be $k$-uniform. However, many real-world systems are not confined only to have interactions involving $k$ components. Thus, these algorithms are not suitable for many real-world applications. The second class of algorithms requires a candidate set of hyperedges from which the potential hyperedges are chosen. In the absence of domain knowledge, the candidate set can have $\mathcal{O}(2^{n})$ possible hyperedges, which makes this problem intractable. We propose HPRA - Hyperedge Prediction using Resource Allocation, the first of its kind algorithm, which overcomes these issues and predicts hyperedges of any cardinality without using any candidate hyperedge set. HPRA is a similarity-based method working on the principles of the resource allocation process. In addition to recovering missing hyperedges, we demonstrate that HPRA can predict future hyperedges in a wide range of hypergraphs.

preprint2020arXiv

MILE: A Multi-Level Framework for Scalable Graph Embedding

Recently there has been a surge of interest in designing graph embedding methods. Few, if any, can scale to a large-sized graph with millions of nodes due to both computational complexity and memory requirements. In this paper, we relax this limitation by introducing the MultI-Level Embedding (MILE) framework -- a generic methodology allowing contemporary graph embedding methods to scale to large graphs. MILE repeatedly coarsens the graph into smaller ones using a hybrid matching technique to maintain the backbone structure of the graph. It then applies existing embedding methods on the coarsest graph and refines the embeddings to the original graph through a graph convolution neural network that it learns. The proposed MILE framework is agnostic to the underlying graph embedding techniques and can be applied to many existing graph embedding methods without modifying them. We employ our framework on several popular graph embedding techniques and conduct embedding for real-world graphs. Experimental results on five large-scale datasets demonstrate that MILE significantly boosts the speed (order of magnitude) of graph embedding while generating embeddings of better quality, for the task of node classification. MILE can comfortably scale to a graph with 9 million nodes and 40 million edges, on which existing methods run out of memory or take too long to compute on a modern workstation. Our code and data are publicly available with detailed instructions for adding new base embedding methods: \url{https://github.com/jiongqian/MILE}.

preprint2020arXiv

Towards Quantifying the Distance between Opinions

Increasingly, critical decisions in public policy, governance, and business strategy rely on a deeper understanding of the needs and opinions of constituent members (e.g. citizens, shareholders). While it has become easier to collect a large number of opinions on a topic, there is a necessity for automated tools to help navigate the space of opinions. In such contexts understanding and quantifying the similarity between opinions is key. We find that measures based solely on text similarity or on overall sentiment often fail to effectively capture the distance between opinions. Thus, we propose a new distance measure for capturing the similarity between opinions that leverages the nuanced observation -- similar opinions express similar sentiment polarity on specific relevant entities-of-interest. Specifically, in an unsupervised setting, our distance measure achieves significantly better Adjusted Rand Index scores (up to 56x) and Silhouette coefficients (up to 21x) compared to existing approaches. Similarly, in a supervised setting, our opinion distance measure achieves considerably better accuracy (up to 20% increase) compared to extant approaches that rely on text similarity, stance similarity, and sentiment similarity

preprint2020arXiv

Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing

Visual Question Answering (VQA) systems are tasked with answering natural language questions corresponding to a presented image. Traditional VQA datasets typically contain questions related to the spatial information of objects, object attributes, or general scene questions. Recently, researchers have recognized the need to improve the balance of such datasets to reduce the system's dependency on memorized linguistic features and statistical biases, while aiming for enhanced visual understanding. However, it is unclear whether any latent patterns exist to quantify and explain these failures. As an initial step towards better quantifying our understanding of the performance of VQA models, we use a taxonomy of Knowledge Gaps (KGs) to tag questions with one or more types of KGs. Each Knowledge Gap (KG) describes the reasoning abilities needed to arrive at a resolution. After identifying KGs for each question, we examine the skew in the distribution of questions for each KG. We then introduce a targeted question generation model to reduce this skew, which allows us to generate new types of questions for an image. These new questions can be added to existing VQA datasets to increase the diversity of questions and reduce the skew.

preprint2019arXiv

Twitter Watch: Leveraging Social Media to Monitor and Predict Collective-Efficacy of Neighborhoods

Sociologists associate the spatial variation of crime within an urban setting, with the concept of collective efficacy. The collective efficacy of a neighborhood is defined as social cohesion among neighbors combined with their willingness to intervene on behalf of the common good. Sociologists measure collective efficacy by conducting survey studies designed to measure individuals' perception of their community. In this work, we employ the curated data from a survey study (ground truth) and examine the effectiveness of substituting costly survey questionnaires with proxies derived from social media. We enrich a corpus of tweets mentioning a local venue with several linguistic and topological features. We then propose a pairwise learning to rank model with the goal of identifying a ranking of neighborhoods that is similar to the ranking obtained from the ground truth collective efficacy values. In our experiments, we find that our generated ranking of neighborhoods achieves 0.77 Kendall tau-x ranking agreement with the ground truth ranking. Overall, our results are up to 37% better than traditional baselines.