Source author record

Mayank Kejriwal

Mayank Kejriwal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Databases Computation and Language Computational Complexity cs.CY Data Structures and Algorithms Information Retrieval Machine Learning physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

13works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Compound AI Agent for Conversational Grant Discovery

Research funding discovery remains fundamentally fragmented: researchers navigate disparate agency portals (e.g., in the United States, NSF, NIH, DARPA, Grants.gov, and many others) with heterogeneous interfaces, search capabilities, and data schemas. We present a compound AI system that unifies this landscape through two tightly coupled components: (1) an aggregation layer that autonomously collects, normalizes, and indexes almost 12,000 federal and nonprofit opportunities from fragmented sources via LLM-equipped browser agents, maintaining a biweekly-updated unified database; and (2) an agentic ReAct-based query processing layer that interprets research context (including from PDF documents) and employs hybrid search combining a structured index with selective web search to retrieve relevant opportunities - while avoiding LLM hallucination. The conversational interface supports iterative refinement through multi-turn interactions, allowing researchers to progressively apply constraints without reformulating their core research description. Results stream in real time with full transparency of intermediate reasoning, enabling appropriate calibration of user trust. Currently used by almost 3,000+ users, our approach demonstrates the feasibility of compound AI in reducing grant discovery time from 30--45 minutes (manual, fragmented portal searches) to under 10 minutes (unified, conversational search).

preprint2022arXiv

A Theoretically Grounded Benchmark for Evaluating Machine Commonsense

Programming machines with commonsense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community. Current CSR benchmarks use multiple-choice (and in relatively fewer cases, generative) question-answering instances to evaluate machine commonsense. Recent progress in transformer-based language representation models suggest that considerable progress has been made on existing benchmarks. However, although tens of CSR benchmarks currently exist, and are growing, it is not evident that the full suite of commonsense capabilities have been systematically evaluated. Furthermore, there are doubts about whether language models are 'fitting' to a benchmark dataset's training partition by picking up on subtle, but normatively irrelevant (at least for CSR), statistical features to achieve good performance on the testing partition. To address these challenges, we propose a benchmark called Theoretically-Grounded Commonsense Reasoning (TG-CSR) that is also based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense, such as space, time, and world states. TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs. The benchmark is also designed to be few-shot (and in the future, zero-shot), with only a few training and validation examples provided. This report discusses the structure and construction of the benchmark. Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks. Benchmark access and leaderboard: https://codalab.lisn.upsaclay.fr/competitions/3080 Benchmark website: https://usc-isi-i2.github.io/TGCSR/

preprint2022arXiv

Can Scale-free Network Growth with Triad Formation Capture Simplicial Complex Distributions in Real Communication Networks?

In recent years, there has been a growing recognition that higher-order structures are important features in real-world networks. A particular class of structures that has gained prominence is known as a simplicial complex. Despite their application to complex processes such as social contagion and novel measures of centrality, not much is currently understood about the distributional properties of these complexes in communication networks. Furthermore, it is also an open question as to whether an established growth model, such as scale-free network growth with triad formation, is sophisticated enough to capture the distributional properties of simplicial complexes. In this paper, we use empirical data on five real-world communication networks to propose a functional form for the distributions of two important simplicial complex structures. We also show that, while the scale-free network growth model with triad formation captures the form of these distributions in networks evolved using the model, the best-fit parameters are significantly different between the real network and its simulated equivalent. An auxiliary contribution is an empirical profile of the two simplicial complexes in these five real-world networks.

preprint2022arXiv

Decision Making in Monopoly using a Hybrid Deep Reinforcement Learning Approach

Learning to adapt and make real-time informed decisions in a dynamic and complex environment is a challenging problem. Monopoly is a popular strategic board game that requires players to make multiple decisions during the game. Decision-making in Monopoly involves many real-world elements such as strategizing, luck, and modeling of opponent's policies. In this paper, we present novel representations for the state and action space for the full version of Monopoly and define an improved reward function. Using these, we show that our deep reinforcement learning agent can learn winning strategies for Monopoly against different fixed-policy agents. In Monopoly, players can take multiple actions even if it is not their turn to roll the dice. Some of these actions occur more frequently than others, resulting in a skewed distribution that adversely affects the performance of the learning agent. To tackle the non-uniform distribution of actions, we propose a hybrid approach that combines deep reinforcement learning (for frequent but complex decisions) with a fixed policy approach (for infrequent but straightforward decisions). Experimental results show that our hybrid agent outperforms a standard deep reinforcement learning agent by 30% in the number of games won against fixed-policy agents.

preprint2022arXiv

Robust Quantification of Gender Disparity in Pre-Modern English Literature using Natural Language Processing

Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the Natural Language Processing (NLP) literature have been proposed for measuring such disparity using relatively extensive datasets and empirically rigorous methodologies. In this paper, we contribute to this line of research by studying gender disparity, at scale, in copyright-expired literary texts published in the pre-modern period (defined in this work as the period ranging from the mid-nineteenth through the mid-twentieth century). One of the challenges in using such tools is to ensure quality control, and by extension, trustworthy statistical analysis. Another challenge is in using materials and methods that are publicly available and have been established for some time, both to ensure that they can be used and vetted in the future, and also, to add confidence to the methodology itself. We present our solution to addressing these challenges, and using multiple measures, demonstrate the significant discrepancy between the prevalence of female characters and male characters in pre-modern literature. The evidence suggests that the discrepancy declines when the author is female. The discrepancy seems to be relatively stable as we plot data over the decades in this century-long period. Finally, we aim to carefully describe both the limitations and ethical caveats associated with this study, and others like it.

preprint2021arXiv

A Data-Driven Study of Commonsense Knowledge using the ConceptNet Knowledge Base

Acquiring commonsense knowledge and reasoning is recognized as an important frontier in achieving general Artificial Intelligence (AI). Recent research in the Natural Language Processing (NLP) community has demonstrated significant progress in this problem setting. Despite this progress, which is mainly on multiple-choice question answering tasks in limited settings, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. In this paper, we propose and conduct a systematic study to enable a deeper understanding of commonsense knowledge by doing an empirical and structural analysis of the ConceptNet knowledge base. ConceptNet is a freely available knowledge base containing millions of commonsense assertions presented in natural language. Detailed experimental results on three carefully designed research questions, using state-of-the-art unsupervised graph representation learning ('embedding') and clustering techniques, reveal deep substructures in ConceptNet relations, allowing us to make data-driven and computational claims about the meaning of phenomena such as 'context' that are traditionally discussed only in qualitative terms. Furthermore, our methodology provides a case study in how to use data-science and computational methodologies for understanding the nature of an everyday (yet complex) psychological phenomenon that is an essential feature of human intelligence.

preprint2020arXiv

An Experimental Study of The Effects of Position Bias on Emotion CauseExtraction

Emotion Cause Extraction (ECE) aims to identify emotion causes from a document after annotating the emotion keywords. Some baselines have been proposed to address this problem, such as rule-based, commonsense based and machine learning methods. We show, however, that a simple random selection approach toward ECE that does not require observing the text achieves similar performance compared to the baselines. We utilized only position information relative to the emotion cause to accomplish this goal. Since position information alone without observing the text resulted in higher F-measure, we therefore uncovered a bias in the ECE single genre Sina-news benchmark. Further analysis showed that an imbalance of emotional cause location exists in the benchmark, with a majority of cause clauses immediately preceding the central emotion clause. We examine the bias from a linguistic perspective, and show that high accuracy rate of current state-of-art deep learning models that utilize location information is only evident in datasets that contain such position biases. The accuracy drastically reduced when a dataset with balanced location distribution is introduced. We therefore conclude that it is the innate bias in this benchmark that caused high accuracy rate of these deep learning models in ECE. We hope that the case study in this paper presents both a cautionary lesson, as well as a template for further studies, in interpreting the superior fit of deep learning models without checking for bias.

preprint2020arXiv

On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices

Schema.org has experienced high growth in recent years. Structured descriptions of products embedded in HTML pages are now not uncommon, especially on e-commerce websites. The Web Data Commons (WDC) project has extracted schema.org data at scale from webpages in the Common Crawl and made it available as an RDF `knowledge graph' at scale. The portion of this data that specifically describes products offers a golden opportunity for researchers and small companies to leverage it for analytics and downstream applications. Yet, because of the broad and expansive scope of this data, it is not evident whether the data is usable in its raw form. In this paper, we do a detailed empirical study on the product-specific schema.org data made available by WDC. Rather than simple analysis, the goal of our study is to devise an empirically grounded set of best practices for using and consuming WDC product-specific schema.org data. Our studies reveal six best practices, each of which is justified by experimental data and analysis.

preprint2016arXiv

An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

Entity Resolution, also called record linkage or deduplication, refers to the process of identifying and merging duplicate versions of the same entity into a unified representation. The standard practice is to use a Rule based or Machine Learning based model that compares entity pairs and assigns a score to represent the pairs' Match/Non-Match status. However, performing an exhaustive pair-wise comparison on all pairs of records leads to quadratic matcher complexity and hence a Blocking step is performed before the Matching to group similar entities into smaller blocks that the matcher can then examine exhaustively. Several blocking schemes have been developed to efficiently and effectively block the input dataset into manageable groups. At CareerBuilder (CB), we perform deduplication on massive datasets of people profiles collected from disparate sources with varying informational content. We observed that, employing a single blocking technique did not cover the base for all possible scenarios due to the multi-faceted nature of our data sources. In this paper, we describe our ensemble approach to blocking that combines two different blocking techniques to leverage their respective strengths.

preprint2016arXiv

Experience: Type alignment on DBpedia and Freebase

Linked Open Data exhibits growth in both volume and variety of published data. Due to this variety, instances of many different types (e.g. Person) can be found in published datasets. Type alignment is the problem of automatically matching types (in a possibly many-many fashion) between two such datasets. Type alignment is an important preprocessing step in instance matching. Instance matching concerns identifying pairs of instances referring to the same underlying entity. By performing type alignment a priori, only instances conforming to aligned types are processed together, leading to significant savings. This article describes a type alignment experience with two large-scale cross-domain RDF knowledge graphs, DBpedia and Freebase, that contain hundreds, or even thousands, of unique types. Specifically, we present a MapReduce-based type alignment algorithm and show that there are at least three reasonable ways of evaluating type alignment within the larger context of instance matching. We comment on the consistency of those results, and note some general observations for researchers evaluating similar algorithms on cross-domain graphs.

preprint2016arXiv

Self-contained NoSQL Resources for Cross-Domain RDF

Cross-domain knowledge bases such as DBpedia, Freebase and YAGO have emerged as encyclopedic hubs in the Web of Linked Data. Despite enabling several practical applications in the Semantic Web, the large-scale, schema-free nature of such graphs often precludes research groups from employing them widely as evaluation test cases for entity resolution and instance-based ontology alignment applications. Although the ground-truth linkages between the three knowledge bases above are available, they are not amenable to resource-limited applications. One reason is that the ground-truth files are not self-contained, meaning that a researcher must usually perform a series of expensive joins (typically in MapReduce) to obtain usable information sets. In this paper, we upload several publicly licensed data resources to the public cloud and use simple Hadoop clusters to compile, and make accessible, three cross-domain self-contained test cases involving linked instances from DBpedia, Freebase and YAGO. Self-containment is enabled by virtue of a simple NoSQL JSON-like serialization format. Potential applications for these resources, particularly related to testing transfer learning research hypotheses, are also briefly described.

preprint2015arXiv

A DNF Blocking Scheme Learner for Heterogeneous Datasets

Entity Resolution concerns identifying co-referent entity pairs across datasets. A typical workflow comprises two steps. In the first step, a blocking method uses a one-many function called a blocking scheme to map entities to blocks. In the second step, entities sharing a block are paired and compared. Current DNF blocking scheme learners (DNF-BSLs) apply only to structurally homogeneous tables. We present an unsupervised algorithmic pipeline for learning DNF blocking schemes on RDF graph datasets, as well as structurally heterogeneous tables. Previous DNF-BSLs are admitted as special cases. We evaluate the pipeline on six real-world dataset pairs. Unsupervised results are shown to be competitive with supervised and semi-supervised baselines. To the best of our knowledge, this is the first unsupervised DNF-BSL that admits RDF graphs and structurally heterogeneous tables as inputs.

preprint2015arXiv

On the Complexity of Sorted Neighborhood

Record linkage concerns identifying semantically equivalent records in databases. Blocking methods are employed to avoid the cost of full pairwise similarity comparisons on $n$ records. In a seminal work, Hernandez and Stolfo proposed the Sorted Neighborhood blocking method. Several empirical variants have been proposed in recent years. In this paper, we investigate the complexity of the Sorted Neighborhood procedure on which the variants are built. We show that achieving maximum performance on the Sorted Neighborhood procedure entails solving a sub-problem, which is shown to be NP-complete by reducing from the Travelling Salesman Problem. We also show that the sub-problem can occur in the traditional blocking method. Finally, we draw on recent developments concerning approximate Travelling Salesman solutions to define and analyze three approximation algorithms.

Mayank Kejriwal

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

A Compound AI Agent for Conversational Grant Discovery

A Theoretically Grounded Benchmark for Evaluating Machine Commonsense

Can Scale-free Network Growth with Triad Formation Capture Simplicial Complex Distributions in Real Communication Networks?

Decision Making in Monopoly using a Hybrid Deep Reinforcement Learning Approach

Robust Quantification of Gender Disparity in Pre-Modern English Literature using Natural Language Processing

A Data-Driven Study of Commonsense Knowledge using the ConceptNet Knowledge Base

An Experimental Study of The Effects of Position Bias on Emotion CauseExtraction

On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices

An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

Experience: Type alignment on DBpedia and Freebase

Self-contained NoSQL Resources for Cross-Domain RDF

A DNF Blocking Scheme Learner for Heterogeneous Datasets

On the Complexity of Sorted Neighborhood