Source author record

Debojyoti Dutta

Debojyoti Dutta appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Cryptography and Security cs.CY Databases Distributed, Parallel, and Cluster Computing Information Retrieval Methodology Networking and Internet Architecture Performance Social and Information Networks

Catalog footprint

What is connected

7works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Action Shapley: A Training Data Selection Metric for World Model in Reinforcement Learning

Numerous offline and model-based reinforcement learning systems incorporate world models to emulate the inherent environments. A world model is particularly important in scenarios where direct interactions with the real environment is costly, dangerous, or impractical. The efficacy and interpretability of such world models are notably contingent upon the quality of the underlying training data. In this context, we introduce Action Shapley as an agnostic metric for the judicious and unbiased selection of training data. To facilitate the computation of Action Shapley, we present a randomized dynamic algorithm specifically designed to mitigate the exponential complexity inherent in traditional Shapley value computations. Through empirical validation across five data-constrained real-world case studies, the algorithm demonstrates a computational efficiency improvement exceeding 80\% in comparison to conventional exponential time computations. Furthermore, our Action Shapley-based training data selection policy consistently outperforms ad-hoc training data selection.

preprint2022arXiv

Data-Driven Evaluation of Training Action Space for Reinforcement Learning

Training action space selection for reinforcement learning (RL) is conflict-prone due to complex state-action relationships. To address this challenge, this paper proposes a Shapley-inspired methodology for training action space categorization and ranking. To reduce exponential-time shapley computations, the methodology includes a Monte Carlo simulation to avoid unnecessary explorations. The effectiveness of the methodology is illustrated using a cloud infrastructure resource tuning case study. It reduces the search space by 80\% and categorizes the training action sets into dispensable and indispensable groups. Additionally, it ranks different training actions to facilitate high-performance yet cost-efficient RL model design. The proposed data-driven methodology is extensible to different domains, use cases, and reinforcement learning algorithms.

preprint2020arXiv

MLPerf Training Benchmark

Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.

preprint2014arXiv

Detecting fraudulent activity in a cloud using privacy-friendly data aggregates

More users and companies make use of cloud services every day. They all expect a perfect performance and any issue to remain transparent to them. This last statement is very challenging to perform. A user's activities in our cloud can affect the overall performance of our servers, having an impact on other resources. We can consider these kind of activities as fraudulent. They can be either illegal activities, such as launching a DDoS attack or just activities which are undesired by the cloud provider, such as Bitcoin mining, which uses substantial power, reduces the life of the hardware and can possibly slow down other user's activities. This article discusses a method to detect such activities by using non-intrusive, privacy-friendly data: billing data. We use OpenStack as an example with data provided by Telemetry, the component in charge of measuring resource usage for billing purposes. Results will be shown proving the efficiency of this method and ways to improve it will be provided as well as its advantages and disadvantages.

preprint2012arXiv

Optimal bandwidth-aware VM allocation for Infrastructure-as-a-Service

Infrastructure-as-a-Service (IaaS) providers need to offer richer services to be competitive while optimizing their resource usage to keep costs down. Richer service offerings include new resource request models involving bandwidth guarantees between virtual machines (VMs). Thus we consider the following problem: given a VM request graph (where nodes are VMs and edges represent virtual network connectivity between the VMs) and a real data center topology, find an allocation of VMs to servers that satisfies the bandwidth guarantees for every virtual network edge---which maps to a path in the physical network---and minimizes congestion of the network. Previous work has shown that for arbitrary networks and requests, finding the optimal embedding satisfying bandwidth requests is $\mathcal{NP}$-hard. However, in most data center architectures, the routing protocols employed are based on a spanning tree of the physical network. In this paper, we prove that the problem remains $\mathcal{NP}$-hard even when the physical network topology is restricted to be a tree, and the request graph topology is also restricted. We also present a dynamic programming algorithm for computing the optimal embedding in a tree network which runs in time $O(3^kn)$, where $n$ is the number of nodes in the physical topology and $k$ is the size of the request graph, which is well suited for practical requests which have small $k$. Such requests form a large class of web-service and enterprise workloads. Also, if we restrict the requests topology to a clique (all VMs connected to a virtual switch with uniform bandwidth requirements), we show that the dynamic programming algorithm can be modified to output the minimum congestion embedding in time $O(k^2n)$.

preprint2011arXiv

Widescope - A social platform for serious conversations on the Web

There are several web platforms that people use to interact and exchange ideas, such as social networks like Facebook, Twitter, and Google+; Q&A sites like Quora and Yahoo! Answers; and myriad independent fora. However, there is a scarcity of platforms that facilitate discussion of complex subjects where people with divergent views can easily rationalize their points of view using a shared knowledge base, and leverage it towards shared objectives, e.g. to arrive at a mutually acceptable compromise. In this paper, as a first step, we present Widescope, a novel collaborative web platform for catalyzing shared understanding of the US Federal and State budget debates in order to help users reach data-driven consensus about the complex issues involved. It aggregates disparate sources of financial data from different budgets (i.e. from past, present, and proposed) and presents a unified interface using interactive visualizations. It leverages distributed collaboration to encourage exploration of ideas and debate. Users can propose budgets ab-initio, support existing proposals, compare between different budgets, and collaborate with others in real time. We hypothesize that such a platform can be useful in bringing people's thoughts and opinions closer. Toward this, we present preliminary evidence from a simple pilot experiment, using triadic voting (which we also formally analyze to show that is better than hot-or-not voting), that 5 out of 6 groups of users with divergent views (conservatives vs liberals) come to a consensus while aiming to halve the deficit using Widescope. We believe that tools like Widescope could have a positive impact on other complex, data-driven social issues.

preprint2010arXiv

Similarity Search and Locality Sensitive Hashing using TCAMs

Similarity search methods are widely used as kernels in various machine learning applications. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors is known to be a hard problem and in high dimensions, best known solutions offer little improvement over a linear scan. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in $n$ with exponent greater than $1$, and query time sublinear, but still polynomial in $n$, where $n$ is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents $0$, $1$ or $*$. The $*$ is a wild card representing either a $0$ or a $1$. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into $\{0,1,*\}$. By using the added functionality of a TLSH scheme with respect to the $*$ character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We believe that this work can open new avenues in very high speed data mining.