Source author record

K. K. Ramakrishnan

K. K. Ramakrishnan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Networking and Internet Architecture Distributed, Parallel, and Cluster Computing eess.SY Machine Learning Neural and Evolutionary Computing Performance Systems and Control

Catalog footprint

What is connected

10works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Making MoE-based LLM Inference Resilient with Tarragon

Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur.

preprint2021arXiv

CoShare: An Efficient Approach for Redundancy Allocation in NFV

An appealing feature of Network Function Virtualization (NFV) is that in an NFV-based network, a network function (NF) instance may be placed at any node. On the one hand this offers great flexibility in allocation of redundant instances, but on the other hand it makes the allocation a unique and difficult challenge. One particular concern is that there is inherent correlation among nodes due to the structure of the network, thus requiring special care in this allocation. To this aim, our novel approach, called CoShare, is proposed. Firstly, its design takes into consideration the effect of network structural dependency, which might result in the unavailability of nodes of a network after failure of a node. Secondly, to efficiently make use of resources, CoShare proposes the idea of shared reservation, where multiple flows may be allowed to share the same reserved backup capacity at an NF instance. Furthermore, CoShare factors in the heterogeneity in nodes, NF instances and availability requirements of flows in the design. The results from a number of experiments conducted using realistic network topologies show that the integration of structural dependency allows meeting availability requirements for more flows compared to a baseline approach. Specifically, CoShare is able to meet diverse availability requirements in a resource-efficient manner, requiring, e.g., up to 85% in some studied cases, less resource overbuild than the baseline approach that uses the idea of dedicated reservation commonly adopted for redundancy allocation in NFV.

preprint2020arXiv

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference is tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an interdependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. While a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the causes that impact the tuning of a model at different amounts of GPU resources. We present many techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by up to 75 percent and increase throughput by a factor of 5.

preprint2016arXiv

SDNFV: Flexible and Dynamic Software Defined Control of an Application- and Flow-Aware Data Plane

Software Defined Networking (SDN) promises greater flexibility for directing packet flows, and Network Function Virtualization promises to enable dynamic management of software-based network functions. However, the current divide between an intelligent control plane and an overly simple, stateless data plane results in the inability to exploit the flexibility of a software based network. In this paper we propose SDNFV, a framework that expands the capabilities of network processing-and-forwarding elements to flexibly manage packet flows, while retaining both a high performance data plane and an easily managed control plane. SDNFV proposes a hierarchical control framework where decisions are made across the SDN controller, a host-level manager, and individual VMs to best exploit state available at each level. This increases the network's flexibility compared to existing SDNs where controllers often make decisions solely based on the first packet header of a flow. SDNFV intelligently places network services across hosts and connects them in sequential and parallel chains, giving both the SDN controller and individual network functions the ability to enhance and update flow rules to adapt to changing conditions. Our prototype demonstrates how to efficiently and flexibly reroute flows based on data plane state such as packet payloads and traffic characteristics.

preprint2015arXiv

SAID: A Control Protocol for Scalable and Adaptive Information Dissemination in ICN

Information dissemination applications (video, news, social media, etc.) with large number of receivers need to be efficient but also have limited loss tolerance. The new Information-Centric Networks (ICN) paradigm offers an alternative approach for reliably delivering data by naming content and exploiting data available at any intermediate point (e.g., caches). However, receivers are often heterogeneous, with widely varying receive rates. When using existing ICN congestion control mechanisms with in-sequence delivery, a particularly thorny problem of receivers going out-of-sync results in inefficiency and unfairness with heterogeneous receivers. We argue that separating reliability from congestion control leads to more scalable, efficient and fair data dissemination, and propose SAID, a Control Protocol for Scalable and Adaptive Information Dissemination in ICN. To maximize the amount of data transmitted at the first attempt, receivers request any next packet (ANP) of a flow instead of next-in-sequence packet, independent of the provider's transmit rate. This allows providers to transmit at an application-efficient rate, without being limited by the slower receivers. SAID ensures reliable delivery to all receivers eventually, by cooperative repair, while preserving privacy without unduly trusting other receivers.

preprint2014arXiv

Evaluating Opportunistic Delivery of Large Content with TCP over WiFi in I2V Communication

With the increasing interest in connected vehicles, it is useful to evaluate the capability of delivering large content over a WiFi infrastructure to vehicles. The throughput achieved over WiFi channels can be highly variable and also rapidly degrades as the distance from the access point increases. While this behavior is well understood at the data link layer, the interactions across the various protocol layers (data link and up through the transport layer) and the effect of mobility may reduce the amount of content transferred to the vehicle, as it travels along the roadway. This paper examines the throughput achieved at the TCP layer over a carefully designed outdoor WiFi environment and the interactions across the layers that impact the performance achieved, as a function of the receiver mobility. The experimental studies conducted reveal that impairments over the WiFi link (frame loss, ARQ and increased delay) and the residual loss seen by TCP causes a cascade of duplicate ACKs to be generated. This triggers large congestion window reductions at the sender, leading to a drastic degradation of throughput to the vehicular client. To ensure outdoor WiFi infrastructures have the potential to sustain reasonable downlink throughput for drive-by vehicles, we speculate that there is a need to adapt how WiFi and TCP (as well as mobility protocols) function for such vehicular applications.

preprint2014arXiv

Opportunities in a Federated Cloud Marketplace

Recent measurement studies show that there are massively distributed hosting and computing infrastructures deployed in the Internet. Such infrastructures include large data centers and organizations' computing clusters. When idle, these resources can readily serve local users. Such users can be smartphone or tablet users wishing to access services such as remote desktop or CPU/bandwidth intensive activities. Particularly, when they are likely to have high latency to access, or may have no access at all to, centralized cloud providers. Today, however, there is no global marketplace where sellers and buyers of available resources can trade. The recently introduced marketplaces of Amazon and other cloud infrastructures are limited by the network footprint of their own infrastructures and availability of such services in the target country and region. In this article we discuss the potentials for a federated cloud marketplace where sellers and buyers of a number of resources, including storage, computing, and network bandwidth, can freely trade. This ecosystem can be regulated through brokers who act as service level monitors and auctioneers. We conclude by discussing the challenges and opportunities in this space.

preprint2013arXiv

Internames: a name-to-name principle for the future Internet

We propose Internames, an architectural framework in which names are used to identify all entities involved in communication: contents, users, devices, logical as well as physical points involved in the communication, and services. By not having a static binding between the name of a communication entity and its current location, we allow entities to be mobile, enable them to be reached by any of a number of basic communication primitives, enable communication to span networks with different technologies and allow for disconnected operation. Furthermore, with the ability to communicate between names, the communication path can be dynamically bound to any of a number of end-points, and the end-points themselves could change as needed. A key benefit of our architecture is its ability to accommodate gradual migration from the current IP infrastructure to a future that may be a ubiquitous Information Centric Network. Basic building blocks of Internames are: i) a name-based Application Programming Interface; ii) a separation of identifiers (names) and locators; iii) a powerful Name Resolution Service (NRS) that dynamically maps names to locators, as a function of time/location/context/service; iv) a built-in capacity of evolution, allowing a transparent migration from current networks and the ability to include as particular cases current specific architectures. To achieve this vision, shared by many other researchers, we exploit and expand on Information Centric Networking principles, extending ICN functionality beyond content retrieval, easing send-to-name and push services, and allowing to use names also to route data in the return path. A key role in this architecture is played by the NRS, which allows for the co-existence of multiple network "realms", including current IP and non-IP networks, glued together by a name-to-name overarching communication primitive.

preprint2012arXiv

Design and Characterization of a Full-duplex Multi-antenna System for WiFi networks

In this paper, we present an experimental and simulation based study to evaluate the use of full-duplex as a mode in practical IEEE 802.11 networks. To enable the study, we designed a 20 MHz multi-antenna OFDM full-duplex physical layer and a full-duplex capable MAC protocol which is backward compatible with current 802.11. Our extensive over-the-air experiments, simulations and analysis demonstrate the following two results. First, the use of multiple antennas at the physical layer leads to a higher ergodic throughput than its hardware-equivalent multi-antenna half-duplex counterparts, for SNRs above the median SNR encountered in practical WiFi deployments. Second, the proposed MAC translates the physical layer rate gain into near doubling of throughput for multi-node single-AP networks. The two combined results allow us to conclude that there are potentially significant benefits gained from including a full-duplex mode in future WiFi standards.

preprint2012arXiv

Intra- and Inter-Session Network Coding in Wireless Networks

In this paper, we are interested in improving the performance of constructive network coding schemes in lossy wireless environments.We propose I2NC - a cross-layer approach that combines inter-session and intra-session network coding and has two strengths. First, the error-correcting capabilities of intra-session network coding make our scheme resilient to loss. Second, redundancy allows intermediate nodes to operate without knowledge of the decoding buffers of their neighbors. Based only on the knowledge of the loss rates on the direct and overhearing links, intermediate nodes can make decisions for both intra-session (i.e., how much redundancy to add in each flow) and inter-session (i.e., what percentage of flows to code together) coding. Our approach is grounded on a network utility maximization (NUM) formulation of the problem. We propose two practical schemes, I2NC-state and I2NC-stateless, which mimic the structure of the NUM optimal solution. We also address the interaction of our approach with the transport layer. We demonstrate the benefits of our schemes through simulations.

K. K. Ramakrishnan

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Making MoE-based LLM Inference Resilient with Tarragon

CoShare: An Efficient Approach for Redundancy Allocation in NFV

Spatial Sharing of GPU for Autotuning DNN models

SDNFV: Flexible and Dynamic Software Defined Control of an Application- and Flow-Aware Data Plane

SAID: A Control Protocol for Scalable and Adaptive Information Dissemination in ICN

Evaluating Opportunistic Delivery of Large Content with TCP over WiFi in I2V Communication

Opportunities in a Federated Cloud Marketplace

Internames: a name-to-name principle for the future Internet

Design and Characterization of a Full-duplex Multi-antenna System for WiFi networks

Intra- and Inter-Session Network Coding in Wireless Networks