Researcher profile

Junchen Jiang

Junchen Jiang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage. However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache entry to maximise DRAM hits and minimise loading delay without significantly degrading generation quality. Compared to various static compression baselines across three tasks, our system AdaptCache achieves 1.43--2.4 x delay savings at the same quality and 6--55% quality improvements at the same delay.

preprint2026arXiv

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.

preprint2022arXiv

AccMPEG: Optimizing Video Encoding for Video Analytics

With more videos being recorded by edge sensors (cameras) and analyzed by computer-vision deep neural nets (DNNs), a new breed of video streaming systems has emerged, with the goal to compress and stream videos to remote servers in real time while preserving enough information to allow highly accurate inference by the server-side DNNs. An ideal design of the video streaming system should simultaneously meet three key requirements: (1) low latency of encoding and streaming, (2) high accuracy of server-side DNNs, and (3) low compute overheads on the camera. Unfortunately, despite many recent efforts, such video streaming system has hitherto been elusive, especially when serving advanced vision tasks such as object detection or semantic segmentation. This paper presents AccMPEG, a new video encoding and streaming system that meets all the three requirements. The key is to learn how much the encoding quality at each (16x16) macroblock can influence the server-side DNN accuracy, which we call accuracy gradient. Our insight is that these macroblock-level accuracy gradient can be inferred with sufficient precision by feeding the video frames through a cheap model. AccMPEG provides a suite of techniques that, given a new server-side DNN, can quickly create a cheap model to infer the accuracy gradient on any new frame in near realtime. Our extensive evaluation of AccMPEG on two types of edge devices (one Intel Xeon Silver 4100 CPU or NVIDIA Jetson Nano) and three vision tasks (six recent pre-trained DNNs) shows that AccMPEG (with the same camera-side compute resources) can reduce the end-to-end inference delay by 10-43% without hurting accuracy compared to the state-of-the-art baselines

preprint2022arXiv

Automatic Curriculum Generation for Learning Adaptation in Networking

As deep reinforcement learning (RL) showcases its strengths in networking and systems, its pitfalls also come to the public's attention--when trained to handle a wide range of network workloads and previously unseen deployment environments, RL policies often manifest suboptimal performance and poor generalizability. To tackle these problems, we present Genet, a new training framework for learning better RL-based network adaptation algorithms. Genet is built on the concept of curriculum learning, which has proved effective against similar issues in other domains where RL is extensively employed. At a high level, curriculum learning gradually presents more difficult environments to the training, rather than choosing them randomly, so that the current RL model can make meaningful progress in training. However, applying curriculum learning in networking is challenging because it remains unknown how to measure the "difficulty" of a network environment. Instead of relying on handcrafted heuristics to determine the environment's difficulty level, our insight is to utilize traditional rule-based (non-RL) baselines: If the current RL model performs significantly worse in a network environment than the baselines, then the model's potential to improve when further trained in this environment is substantial. Therefore, Genet automatically searches for the environments where the current model falls significantly behind a traditional baseline scheme and iteratively promotes these environments as the training progresses. Through evaluating Genet on three use cases--adaptive video streaming, congestion control, and load balancing, we show that Genet produces RL policies which outperform both regularly trained RL policies and traditional baselines in each context, not only under synthetic workloads but also in real environments.

preprint2022arXiv

Bandwidth-Efficient Multi-video Prefetching for Short Video Streaming

Applications that allow sharing of user-created short videos exploded in popularity in recent years. A typical short video application allows a user to swipe away the current video being watched and start watching the next video in a video queue. Such user interface causes significant bandwidth waste if users frequently swipe a video away before finishing watching. Solutions to reduce bandwidth waste without impairing the Quality of Experience (QoE) are needed. Solving the problem requires adaptively prefetching of short video chunks, which is challenging as the download strategy needs to match unknown user viewing behavior and network conditions. In our work, we first formulate the problem of adaptive multi-video prefetching in short video streaming. Then, to facilitate the integration and comparison of researchers' algorithms towards solving the problem, we design and implement a discrete-event simulator, which we release as open source. Finally, based on the organization of the Short Video Streaming Grand Challenge at ACM Multimedia 2022, we analyze and summarize the algorithms of the contestants, with the hope of promoting the research community towards addressing this problem.

preprint2020arXiv

A New Abstraction for Internet QoE Optimization

A perennial quest in networking research is how to achieve higher quality of experience (QoE) for users without incurring more resources. This work revisits an important yet often overlooked piece of the puzzle: what should the QoE abstraction be? A QoE abstraction is a representation of application quality that describes how decisions affect QoE. The conventional wisdom has relied on developing hand-crafted quality metrics (e.g., video rebuffering events, web page loading time) that are specialized to each application, content, and setting. We argue that in many cases, it maybe fundamentally hard to capture a user's perception of quality using a list of handcrafted metrics, and that expanding the metric list may lead to unnecessary complexity in the QoE model without a commensurate gain. Instead, we advocate for a new approach based on a new QoE abstraction called visual rendering. Rather than a list of metrics, we model the process of quality perception as a user watching a continuous "video" (visual rendering) of all the pixels on their screen. The key advantage of visual rendering is that it captures the full experience of a user with the same abstraction for all applications. This new abstraction opens new opportunities (e.g., the possibility of end-to-end deep learning models that infer QoE directly from a visual rendering) but it also gives rise to new research challenges (e.g., how to emulate the effect on visual rendering of an application decision). This paper makes the case for visual rendering as a unifying abstraction for Internet QoE and outlines a new research agenda to unleash its opportunities.

preprint2020arXiv

Domain-specific Communication Optimization for Distributed DNN Training

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and layer-wise flow scheduling, etc., are still coarse-grained and insufficient for an efficient distributed training especially when the network is under pressure. We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. At its heart, DLCP comprises of several key innovations beyond prior work: e.g., it exploits {\em bounded loss tolerance} of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy. In addition, it leverages inter-packet order-independency to perform per-packet load balancing without causing classical re-ordering issues. DLCP works with both Parameter Server and collective communication routines. We have implemented DLCP with commodity switches, integrated it with various training frameworks including TensorFlow, MXNet and PyTorch, and deployed it in our small-scale testbed with 10 Nvidia V100 GPUs. Our testbed experiments and large-scale simulations show that DLCP delivers up to $84.3\%$ additional training acceleration over the best existing solutions.

preprint2020arXiv

SENSEI: Aligning Video Streaming Quality with Dynamic User Sensitivity

This paper aims to improve video streaming by leveraging a simple observation: users are more sensitive to low quality in certain parts of a video than in others. For instance, rebuffering during key moments of a sports video (e.g., before a goal is scored) is more annoying than rebuffering during normal gameplay. Such dynamic quality sensitivity, however, is rarely captured by current approaches, which predict QoE (quality-of-experience) using one-size-fits-all heuristics that are too simplistic to understand the nuances of video content. Instead of proposing yet another heuristic, we take a different approach: we run a separate crowdsourcing experiment for each video to derive users' quality sensitivity at different parts of the video. Of course, the cost of doing this at scale can be prohibitive, but we show that careful experiment design combined with a suite of pruning techniques can make the cost negligible compared to how much content providers invest in content generation and distribution. Our ability to accurately profile time-varying user sensitivity inspires a new approach: dynamically aligning higher (lower) quality with higher (lower) sensitivity periods. We present a new video streaming system called SENSEI that incorporates dynamic quality sensitivity into existing quality adaptation algorithms. We apply SENSEI to two state-of-the-art adaptation algorithms. SENSEI can take seemingly unusual actions: e.g., lowering bitrate (or initiating a rebuffering event) even when bandwidth is sufficient so that it can maintain a higher bitrate without rebuffering when quality sensitivity becomes higher in the near future. Compared to state-of-the-art approaches, SENSEI improves QoE by 15.1% or achieves the same QoE with 26.8% less bandwidth on average.