Source author record

K. V. Rashmi

K. V. Rashmi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT Distributed, Parallel, and Cluster Computing Networking and Internet Architecture Cryptography and Security Machine Learning Artificial Intelligence Computation and Language

Catalog footprint

What is connected

19works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.

preprint2022arXiv

Bandwidth Cost of Code Conversions in the Split Regime

Distributed storage systems must store large amounts of data over long periods of time. To avoid data loss due to device failures, an $[n,k]$ erasure code is used to encode $k$ data symbols into a codeword of $n$ symbols that are stored across different devices. However, device failure rates change throughout the life of the data, and tuning $n$ and $k$ according to these changes has been shown to save significant storage space. Code conversion is the process of converting multiple codewords of an initial $[n^I,k^I]$ code into codewords of a final $[n^F,k^F]$ code that decode to the same set of data symbols. In this paper, we study conversion bandwidth, defined as the total amount of data transferred between nodes during conversion. In particular, we consider the case where the initial and final codes are MDS and a single initial codeword is split into several final codewords ($k^I=λ^F k^F$ for integer $λ^F \geq 2$), called the split regime. We derive lower bounds on the conversion bandwidth in the split regime and propose constructions that significantly reduce conversion bandwidth and are optimal for certain parameters.

preprint2022arXiv

Learning-Augmented Streaming Codes are Approximately Optimal for Variable-Size Messages

Real-time streaming communication requires a high quality of service despite contending with packet loss. Streaming codes are a class of codes best suited for this setting. A key challenge for streaming codes is that they operate in an "online" setting in which the amount of data to be transmitted varies over time and is not known in advance. Mitigating the adverse effects of variability requires spreading the data that arrives at a time slot over multiple future packets, and the optimal strategy for spreading depends on the arrival pattern. Algebraic coding techniques alone are therefore insufficient for designing rate-optimal codes. We combine algebraic coding techniques with a learning-augmented algorithm for spreading to design the first approximately rate-optimal streaming codes for a range of parameter regimes that are important for practical applications.

preprint2020arXiv

A locality-based approach for coded computation

Modern distributed computation infrastructures are often plagued by unavailabilities such as failing or slow servers. These unavailabilities adversely affect the tail latency of computation in distributed infrastructures. The simple solution of replicating computation entails significant resource overhead. Coded computation has emerged as a resource-efficient alternative, wherein multiple units of data are encoded to create parity units and the function to be computed is applied to each of these units on distinct servers. A decoder can use the available function outputs to decode the unavailable ones. Existing coded computation approaches are resource efficient only for simple variants of linear functions such as multilinear, with even the class of low degree polynomials requiring the same multiplicative overhead as replication for practically relevant straggler tolerance. In this paper, we present a new approach to model coded computation via the lens of locality of codes. We introduce a generalized notion of locality, denoted computational locality, building upon the locality of an appropriately defined code. We show that computational locality is equivalent to the required number of workers for coded computation and leverage results from the well-studied locality of codes to design coded computation schemes. We show that recent results on coded computation of multivariate polynomials can be derived using local recovering schemes for Reed-Muller codes. We present coded computation schemes for multivariate polynomials that adaptively exploit locality properties of input data-- an inadmissible technique under existing frameworks. These schemes require fewer workers than the lower bound under existing coded computation frameworks, showing that the existing multiplicative overhead on the number of servers is not fundamental for coded computation of nonlinear functions.

preprint2020arXiv

Access-optimal Linear MDS Convertible Codes for All Parameters

In large-scale distributed storage systems, erasure codes are used to achieve fault tolerance in the face of node failures. Tuning code parameters to observed failure rates has been shown to significantly reduce storage cost. Such tuning of redundancy requires "code conversion", i.e., a change in code dimension and length on already encoded data. Convertible codes are a new class of codes designed to perform such conversions efficiently. The access cost of conversion is the number of nodes accessed during conversion. Existing literature has characterized the access cost of conversion of linear MDS convertible codes only for a specific and small subset of parameters. In this paper, we present lower bounds on the access cost of conversion of linear MDS codes for all valid parameters. Furthermore, we show that these lower bounds are tight by presenting an explicit construction for access-optimal linear MDS convertible codes for all valid parameters. En route, we show that, one of the degrees-of-freedom in the design of convertible codes that was inconsequential in the previously studied parameter regimes, turns out to be crucial when going beyond these regimes and adds to the challenge in the analysis and code construction.

preprint2020arXiv

Bandwidth Cost of Code Conversions in Distributed Storage: Fundamental Limits and Optimal Constructions

Erasure codes have become an integral part of distributed storage systems as a tool for providing data reliability and durability under the constant threat of device failures. In such systems, an $[n, k]$ code over a finite field $\mathbb{F}_q$ encodes $k$ message symbols into $n$ codeword symbols from $\mathbb{F}_q$ which are then stored on $n$ different nodes in the system. Recent work has shown that significant savings in storage space can be obtained by tuning $n$ and $k$ to variations in device failure rates. Such a tuning necessitates code conversion: the process of converting already encoded data under an initial $[n^I, k^I]$ code to its equivalent under a final $[n^F, k^F]$ code. The default approach to conversion is to reencode data, which places significant burden on system resources. Convertible codes are a recently proposed class of codes for enabling resource-efficient conversions. Existing work on convertible codes has focused on minimizing access cost, i.e., the number of code symbols accessed during conversion. Bandwidth, which corresponds to the amount of data read and transferred, is another important resource to optimize. In this paper, we initiate the study on the fundamental limits on bandwidth used during code conversion and present constructions for bandwidth-optimal convertible codes. First, we model the code conversion problem using network information flow graphs with variable capacity edges. Second, focusing on MDS codes and an important parameter regime called the merge regime, we derive tight lower bounds on the bandwidth cost of conversion. The derived bounds show that bandwidth cost can be significantly reduced even in regimes where access cost cannot be reduced as compared to the default approach. Third, we present a new construction for MDS convertible codes which matches the proposed lower bound and is thus bandwidth-optimal during conversion.

preprint2015arXiv

DART: Dropouts meet Multiple Additive Regression Trees

Multiple Additive Regression Trees (MART), an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice. However, it suffers an issue which we call over-specialization, wherein trees added at later iterations tend to impact the prediction of only a few instances, and make negligible contribution towards the remaining instances. This negatively affects the performance of the model on unseen data, and also makes the model over-sensitive to the contributions of the few, initially added tress. We show that the commonly used tool to address this issue, that of shrinkage, alleviates the problem only to a certain extent and the fundamental issue of over-specialization still remains. In this work, we explore a different approach to address the problem that of employing dropouts, a tool that has been recently proposed in the context of learning deep neural networks. We propose a novel way of employing dropouts in MART, resulting in the DART algorithm. We evaluate DART on ranking, regression and classification tasks, using large scale, publicly available datasets, and show that DART outperforms MART in each of the tasks, with a significant margin. We also show that DART overcomes the issue of over-specialization to a considerable extent.

preprint2015arXiv

Information-theoretically Secure Erasure Codes for Distributed Storage

Repair operations in distributed storage systems potentially expose the data to malicious acts of passive eavesdroppers or active adversaries, which can be detrimental to the security of the system. This paper presents erasure codes and repair algorithms that ensure security of the data in the presence of passive eavesdroppers and active adversaries, while maintaining high availability, reliability and efficiency in the system. Our codes are optimal in that they meet previously proposed lower bounds on the storage, network-bandwidth, and reliability requirements for a wide range of system parameters. Our results thus establish the capacity of such systems. Our codes for security from active adversaries provide an additional appealing feature of `on-demand security' where the desired level of security can be chosen separately for each instance of repair, and our algorithms remain optimal simultaneously for all possible levels. The paper also provides necessary and sufficient conditions governing the transformation of any (non-secure) code into one providing on-demand security.

preprint2015arXiv

Optimal Systematic Distributed Storage Codes with Fast Encoding

Erasure codes are being increasingly used in distributed-storage systems in place of data-replication, since they provide the same level of reliability with much lower storage overhead. We consider the problem of constructing explicit erasure codes for distributed storage with the following desirable properties motivated by practice: (i) Maximum-Distance-Separable (MDS): to provide maximal reliability at minimum storage overhead, (ii) Optimal repair-bandwidth: to minimize the amount of data needed to be transferred to repair a failed node from remaining ones, (iii) Flexibility in repair: to allow maximal flexibility in selecting subset of nodes to use for repair, which includes not requiring that all surviving nodes be used for repair, (iv) Systematic Form: to ensure that the original data exists in uncoded form, and (v) Fast encoding: to minimize the cost of generating encoded data (enabled by a sparse generator matrix). This paper presents the first explicit code construction which theoretically guarantees all the five desired properties simultaneously. Our construction builds on a powerful class of codes called Product-Matrix (PM) codes. PM codes satisfy properties (i)-(iii), and either (iv) or (v), but not both simultaneously. Indeed, native PM codes have inherent structure that leads to sparsity, but this structure is destroyed when the codes are made systematic. We first present an analytical framework for understanding the interaction between the design of PM codes and the systematic property. Using this framework, we provide an explicit code construction that simultaneously achieves all the above desired properties. We also present general ways of transforming existing storage and repair optimal codes to enable fast encoding through sparsity. In practice, such sparse codes result in encoding speedup by a factor of about 4 for typical parameters.

preprint2014arXiv

Distributed Secret Dissemination Across a Network

Shamir's (n, k) threshold secret sharing is an important component of several cryptographic protocols, such as those for secure multiparty-computation and key management. These protocols typically assume the presence of direct communication links from the dealer to all participants, in which case the dealer can directly pass the shares of the secret to each participant. In this paper, we consider the problem of secret sharing when the dealer does not have direct communication links to all the participants, and instead, the dealer and the participants form a general network. Existing methods are based on secure message transmissions from the dealer to each participant requiring considerable coordination in the network. In this paper, we present a distributed algorithm for disseminating shares over a network, which we call the SNEAK algorithm, requiring each node to know only the identities of its one-hop neighbours. While SNEAK imposes a stronger condition on the network by requiring the dealer to be what we call k-propagating rather than k-connected as required by the existing solutions, we show that in addition to being distributed, SNEAK achieves significant reduction in the communication cost and the amount of randomness required.

preprint2014arXiv

Fundamental Limits on Communication for Oblivious Updates in Storage Networks

In distributed storage systems, storage nodes intermittently go offline for numerous reasons. On coming back online, nodes need to update their contents to reflect any modifications to the data in the interim. In this paper, we consider a setting where no information regarding modified data needs to be logged in the system. In such a setting, a 'stale' node needs to update its contents by downloading data from already updated nodes, while neither the stale node nor the updated nodes have any knowledge as to which data symbols are modified and what their value is. We investigate the fundamental limits on the amount of communication necessary for such an "oblivious" update process. We first present a generic lower bound on the amount of communication that is necessary under any storage code with a linear encoding (while allowing non-linear update protocols). This lower bound is derived under a set of extremely weak conditions, giving all updated nodes access to the entire modified data and the stale node access to the entire stale data as side information. We then present codes and update algorithms that are optimal in that they meet this lower bound. Next, we present a lower bound for an important subclass of codes, that of linear Maximum-Distance-Separable (MDS) codes. We then present an MDS code construction and an associated update algorithm that meets this lower bound. These results thus establish the capacity of oblivious updates in terms of the communication requirements under these settings.

preprint2013arXiv

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

We present a new 'piggybacking' framework for designing distributed storage codes that are efficient in data-read and download required during node-repair. We illustrate the power of this framework by constructing classes of explicit codes that entail the smallest data-read and download for repair among all existing solutions for three important settings: (a) codes meeting the constraints of being Maximum-Distance-Separable (MDS), high-rate and having a small number of substripes, arising out of practical considerations for implementation in data centers, (b) binary MDS codes for all parameters where binary MDS codes exist, (c) MDS codes with the smallest repair-locality. In addition, we employ this framework to enable efficient repair of parity nodes in existing codes that were originally constructed to address the repair of only the systematic nodes. The basic idea behind our framework is to take multiple instances of existing codes and add carefully designed functions of the data of one instance to the other. Typical savings in data-read during repair is 25% to 50% depending on the choice of the code parameters.

preprint2013arXiv

A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

Erasure codes, such as Reed-Solomon (RS) codes, are being increasingly employed in data centers to combat the cost of reliably storing large amounts of data. Although these codes provide optimal storage efficiency, they require significantly high network and disk usage during recovery of missing data. In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook's warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data. To address this issue, we present a new storage code using our recently proposed "Piggybacking" framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day.

preprint2012arXiv

Regenerating Codes for Errors and Erasures in Distributed Storage

Regenerating codes are a class of codes proposed for providing reliability of data and efficient repair of failed nodes in distributed storage systems. In this paper, we address the fundamental problem of handling errors and erasures during the data-reconstruction and node-repair operations. We provide explicit regenerating codes that are resilient to errors and erasures, and show that these codes are optimal with respect to storage and bandwidth requirements. As a special case, we also establish the capacity of a class of distributed storage systems in the presence of malicious adversaries. While our code constructions are based on previously constructed Product-Matrix codes, we also provide necessary and sufficient conditions for introducing resilience in any regenerating code.

preprint2011arXiv

Enabling Node Repair in Any Erasure Code for Distributed Storage

Erasure codes are an efficient means of storing data across a network in comparison to data replication, as they tend to reduce the amount of data stored in the network and offer increased resilience in the presence of node failures. The codes perform poorly though, when repair of a failed node is called for, as they typically require the entire file to be downloaded to repair a failed node. A new class of erasure codes, termed as regenerating codes were recently introduced, that do much better in this respect. However, given the variety of efficient erasure codes available in the literature, there is considerable interest in the construction of coding schemes that would enable traditional erasure codes to be used, while retaining the feature that only a fraction of the data need be downloaded for node repair. In this paper, we present a simple, yet powerful, framework that does precisely this. Under this framework, the nodes are partitioned into two 'types' and encoded using two codes in a manner that reduces the problem of node-repair to that of erasure-decoding of the constituent codes. Depending upon the choice of the two codes, the framework can be used to avail one or more of the following advantages: simultaneous minimization of storage space and repair-bandwidth, low complexity of operation, fewer disk reads at helper nodes during repair, and error detection and correction.

preprint2011arXiv

Information-theoretically Secure Regenerating Codes for Distributed Storage

Regenerating codes are a class of codes for distributed storage networks that provide reliability and availability of data, and also perform efficient node repair. Another important aspect of a distributed storage network is its security. In this paper, we consider a threat model where an eavesdropper may gain access to the data stored in a subset of the storage nodes, and possibly also, to the data downloaded during repair of some nodes. We provide explicit constructions of regenerating codes that achieve information-theoretic secrecy capacity in this setting.

preprint2011arXiv

Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction

Regenerating codes are a class of distributed storage codes that optimally trade the bandwidth needed for repair of a failed node with the amount of data stored per node of the network. Minimum Storage Regenerating (MSR) codes minimize first, the amount of data stored per node, and then the repair bandwidth, while Minimum Bandwidth Regenerating (MBR) codes carry out the minimization in the reverse order. An [n, k, d] regenerating code permits the data to be recovered by connecting to any k of the n nodes in the network, while requiring that repair of a failed node be made possible by connecting (using links of lesser capacity) to any d nodes. Previous, explicit and general constructions of exact-regenerating codes have been confined to the case n=d+1. In this paper, we present optimal, explicit constructions of MBR codes for all feasible values of [n, k, d] and MSR codes for all [n, k, d >= 2k-2], using a product-matrix framework. The particular product-matrix nature of the constructions is shown to significantly simplify system operation. To the best of our knowledge, these are the first constructions of exact-regenerating codes that allow the number n of nodes in the distributed storage network, to be chosen independent of the other parameters. The paper also contains a simpler description, in the product-matrix framework, of a previously constructed MSR code in which the parameter d satisfies [n=d+1, k, d >= 2k-1].

preprint2010arXiv

Distributed Storage Codes with Repair-by-Transfer and Non-achievability of Interior Points on the Storage-Bandwidth Tradeoff

Regenerating codes are a class of recently developed codes for distributed storage that, like Reed-Solomon codes, permit data recovery from any subset of k nodes within the n-node network. However, regenerating codes possess in addition, the ability to repair a failed node by connecting to an arbitrary subset of d nodes. It has been shown that for the case of functional-repair, there is a tradeoff between the amount of data stored per node and the bandwidth required to repair a failed node. A special case of functional-repair is exact-repair where the replacement node is required to store data identical to that in the failed node. Exact-repair is of interest as it greatly simplifies system implementation. The first result of the paper is an explicit, exact-repair code for the point on the storage-bandwidth tradeoff corresponding to the minimum possible repair bandwidth, for the case when d=n-1. This code has a particularly simple graphical description and most interestingly, has the ability to carry out exact-repair through mere transfer of data and without any need to perform arithmetic operations. Hence the term `repair-by-transfer'. The second result of this paper shows that the interior points on the storage-bandwidth tradeoff cannot be achieved under exact-repair, thus pointing to the existence of a separate tradeoff under exact-repair. Specifically, we identify a set of scenarios, termed `helper node pooling', and show that it is the necessity to satisfy such scenarios that over-constrains the system.

preprint2010arXiv

Interference Alignment in Regenerating Codes for Distributed Storage: Necessity and Code Constructions

Regenerating codes are a class of recently developed codes for distributed storage that, like Reed-Solomon codes, permit data recovery from any arbitrary k of n nodes. However regenerating codes possess in addition, the ability to repair a failed node by connecting to any arbitrary d nodes and downloading an amount of data that is typically far less than the size of the data file. This amount of download is termed the repair bandwidth. Minimum storage regenerating (MSR) codes are a subclass of regenerating codes that require the least amount of network storage; every such code is a maximum distance separable (MDS) code. Further, when a replacement node stores data identical to that in the failed node, the repair is termed as exact. The four principal results of the paper are (a) the explicit construction of a class of MDS codes for d = n-1 >= 2k-1 termed the MISER code, that achieves the cut-set bound on the repair bandwidth for the exact-repair of systematic nodes, (b) proof of the necessity of interference alignment in exact-repair MSR codes, (c) a proof showing the impossibility of constructing linear, exact-repair MSR codes for d < 2k-3 in the absence of symbol extension, and (d) the construction, also explicit, of MSR codes for d = k+1. Interference alignment (IA) is a theme that runs throughout the paper: the MISER code is built on the principles of IA and IA is also a crucial component to the non-existence proof for d < 2k-3. To the best of our knowledge, the constructions presented in this paper are the first, explicit constructions of regenerating codes that achieve the cut-set bound.

K. V. Rashmi

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Bandwidth Cost of Code Conversions in the Split Regime

Learning-Augmented Streaming Codes are Approximately Optimal for Variable-Size Messages

A locality-based approach for coded computation

Access-optimal Linear MDS Convertible Codes for All Parameters

Bandwidth Cost of Code Conversions in Distributed Storage: Fundamental Limits and Optimal Constructions

DART: Dropouts meet Multiple Additive Regression Trees

Information-theoretically Secure Erasure Codes for Distributed Storage

Optimal Systematic Distributed Storage Codes with Fast Encoding

Distributed Secret Dissemination Across a Network

Fundamental Limits on Communication for Oblivious Updates in Storage Networks

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

Regenerating Codes for Errors and Erasures in Distributed Storage

Enabling Node Repair in Any Erasure Code for Distributed Storage

Information-theoretically Secure Regenerating Codes for Distributed Storage

Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction

Distributed Storage Codes with Repair-by-Transfer and Non-achievability of Interior Points on the Storage-Bandwidth Tradeoff

Interference Alignment in Regenerating Codes for Distributed Storage: Necessity and Code Constructions