Source author record

Dwaipayan Roy

Dwaipayan Roy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Retrieval Digital Libraries Information Theory math.IT

Catalog footprint

What is connected

6works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Cultural Analytics for Good: Building Inclusive Evaluation Frameworks for Historical IR

This work bridges the fields of information retrieval and cultural analytics to support equitable access to historical knowledge. Using the British Library BL19 digital collection (more than 35,000 works from 1700-1899), we construct a benchmark for studying changes in language, terminology and retrieval in the 19th-century fiction and non-fiction. Our approach combines expert-driven query design, paragraph-level relevance annotation, and Large Language Model (LLM) assistance to create a scalable evaluation framework grounded in human expertise. We focus on knowledge transfer from fiction to non-fiction, investigating how narrative understanding and semantic richness in fiction can improve retrieval for scholarly and factual materials. This interdisciplinary framework not only improves retrieval accuracy but also fosters interpretability, transparency, and cultural inclusivity in digital archives. Our work provides both practical evaluation resources and a methodological paradigm for developing retrieval systems that support richer, historically aware engagement with digital archives, ultimately working towards more emancipatory knowledge infrastructures.

preprint2026arXiv

MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval

Users increasingly expect modern search systems to offer a unified interface that seamlessly retrieves information from diverse data sources and formats. However, current information retrieval (IR) evaluation benchmarks have not kept pace with this development, primarily due to the lack of test collections that represent the diversity of contemporary search domains. We address this critical gap with MIRA, a novel benchmark based on a large-scale social science search platform. MIRA is designed for category-aware ranking across heterogeneous categories - Publications, Research Data, Variables, and Instruments & Tools - within a single, unified evaluation framework. The proposed collection is distinctive in several ways: (1) it is built upon real user queries, providing a more realistic basis for evaluation; (2) it covers scholarly items from four distinct categories, enabling multi-faceted evaluation; and (3) it leverages a Large Language Model to generate topic descriptions and narratives, as well as for relevance assessment with respect to these topics, substantially reducing the labor and cost of test collection generation. We release this resource to benefit the community by providing a foundational testbed for the research on multi-faceted, category-aware, integrated, or cross-category information retrieval.

preprint2022arXiv

A Birds Eye View on Knowledge Graph Embeddings, Software Libraries, Applications and Challenges

In recent years, Knowledge Graph (KG) development has attracted significant researches considering the applications in web search, relation prediction, natural language processing, information retrieval, question answering to name a few. However, often KGs are incomplete due to which Knowledge Graph Completion (KGC) has emerged as a sub-domain of research to automatically track down the missing connections in a KG. Numerous strategies have been suggested to work out the KGC dependent on different representation procedures intended to embed triples into a low-dimensional vector space. Given the difficulties related to KGC, researchers around the world are attempting to comprehend the attributes of the problem statement. This study intends to provide an overview of knowledge bases combined with different challenges and their impacts. We discuss existing KGC approaches, including the state-of-the-art Knowledge Graph Embeddings (KGE), not only on static graphs but also for the latest trends such as multimodal, temporal, and uncertain knowledge graphs. In addition, reinforcement learning techniques are reviewed to model complex queries as a link prediction problem. Subsequently, we explored popular software packages for model training and examine open research challenges that can guide future research.

preprint2022arXiv

Studying Retrievability of Publications and Datasets in an Integrated Retrieval System

In this paper, we investigate the retrievability of datasets and publications in a real-life Digital Library (DL). The measure of retrievability was originally developed to quantify the influence that a retrieval system has on the access to information. Retrievability can also enable DL engineers to evaluate their search engine to determine the ease with which the content in the collection can be accessed. Following this methodology, in our study, we propose a system-oriented approach for studying dataset and publication retrieval. A speciality of this paper is the focus on measuring the accessibility biases of various types of DL items and including a metric of usefulness. Among other metrics, we use Lorenz curves and Gini coefficients to visualize the differences of the two retrievable document types (specifically datasets and publications). Empirical results reported in the paper show a distinguishable diversity in the retrievability scores among the documents of different types.

preprint2016arXiv

Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

A major difficulty in applying word vector embeddings in IR is in devising an effective and efficient strategy for obtaining representations of compound units of text, such as whole documents, (in comparison to the atomic words), for the purpose of indexing and scoring documents. Instead of striving for a suitable method for obtaining a single vector representation of a large document of text, we rather aim for developing a similarity metric that makes use of the similarities between the individual embedded word vectors in a document and a query. More specifically, we represent a document and a query as sets of word vectors, and use a standard notion of similarity measure between these sets, computed as a function of the similarities between each constituent word pair from these sets. We then make use of this similarity measure in combination with standard IR based similarities for document ranking. The results of our initial experimental investigations shows that our proposed method improves MAP by up to $5.77\%$, in comparison to standard text-based language model similarity, on the TREC ad-hoc dataset.

preprint2016arXiv

Using Word Embeddings for Automatic Query Expansion

In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion methods perform similarly with and without any feedback information.

Dwaipayan Roy

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Cultural Analytics for Good: Building Inclusive Evaluation Frameworks for Historical IR

MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval

A Birds Eye View on Knowledge Graph Embeddings, Software Libraries, Applications and Challenges

Studying Retrievability of Publications and Datasets in an Integrated Retrieval System

Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

Using Word Embeddings for Automatic Query Expansion