Researcher profile

Sriram Sankar

Sriram Sankar contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

Detecting silent data corruptions in the wild

Silent Errors within hardware devices occur when an internal defect manifests in a part of the circuit which does not have check logic to detect the incorrect circuit operation. The results of such a defect can range from flipping a single bit in a single data value, up to causing the software to execute the wrong instructions. Silent data corruptions (SDC) in hardware impact computational integrity for large-scale applications. Manifestations of silent errors are accelerated by datapath variations, temperature variance, and age, among other silicon factors. These errors do not leave any record or trace in system logs. As a result, silent errors stay undetected within workloads, and their effects can propagate across several services, causing problems to appear in systems far removed from the original defect. In this paper, we describe testing strategies to detect silent data corruptions within a large scale infrastructure. Given the challenging nature of the problem, we experimented with different methods for detection and mitigation. We compare and contrast two such approaches - 1. Fleetscanner (out-of-production testing) and 2. Ripple (in-production testing).We evaluate the infrastructure tradeoffs associated with the silicon testing funnel across 3+ years of production experience.

preprint2022arXiv

Probing the physicochemical properties of the Leo Ring and the Leo I group

We present an absorption line study of the physical and chemical properties of the Leo HI Ring and the Leo I Group as traced by 11 quasar sightlines spread over a 600 kpc X 800 kpc region. Using HST/COS G130/G160 archival observations as constraints, we couple cloud-by-cloud, multiphase, Bayesian ionization modeling with galaxy property information to determine the plausible origin of the absorbing gas along these sightlines. We search for absorption in the range 600 km/s - 1400 km/s consistent with the kinematics of the Leo Ring/Group. We find absorption plausibly associated with the Leo Ring towards five sightlines. Along three other sightlines, we find absorption likely to be associated with individual galaxies, intragroup gas, and/or large-scale filamentary structure. The absorption along these five sightlines is stronger in metal lines than expected from individual galaxies, indicative of multiple contributions, and of the complex kinematics of the region. We also identify three sightlines within a 7-degree X 6-degree field around the Leo Ring, along which we do not find any absorption. We find that the metallicities associated with the Leo Ring are generally high, with values between solar and several times solar. The inferred high metallicities are consistent with the origin of the ring as tidal debris from a major galaxy merger.

preprint2021arXiv

Silent Data Corruptions at Scale

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

preprint2020arXiv

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challenging due to the complexity of services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for understanding production issues. Another challenge in reviewing the logs for identifying issues is the scale - there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large-scale production datasets. We have successfully rolled out this approach for root cause investigation purposes in a large-scale infrastructure. We also present the setup and results from multiple production use cases in this paper.

preprint2020arXiv

Solar-Metallicity Gas in the Extended Halo of a Galaxy at $z \sim 0.12$

We present the detection and analysis of a weak low-ionization absorber at $z = 0.12122$ along the blazar sightline PG~$1424+240$, using spectroscopic data from both $HST$/COS and STIS. The absorber is a weak Mg II analogue, with incidence of weak C II and Si II, along with multi-component C IV and O VI. The low ions are tracing a dense ($n_{H} \sim 10^{-3}$ cm$^{-3}$) parsec scale cloud of solar or higher metallicity. The kinematically coincident higher ions are either from a more diffuse ($n_{H} \sim 10^{-5} - 10^{-4}$ cm$^{-3}$) photoionized phase of kiloparsec scale dimensions, or are tracing a warm (T $\sim 2 \times 10^{5}$ K) collisionally ionized transition temperature plasma layer. The absorber resides in a galaxy overdense region, with 18 luminous ($> L^*$) galaxies within a projected radius of $5$ Mpc and $750$ km s$^{-1}$ of the absorber. The multi-phase properties, high metallicity and proximity to a $1.4$ $L^*$ galaxy, at $ρ\sim 200$ kpc and $|Δv| = 11$ km s$^{-1}$ separation, favors the possibility of the absorption tracing circumgalactic gas. The absorber serves as an example of weak Mg II - O VI systems as a means to study multiphase high velocity clouds in external galaxies.