Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-Resolution

Recent advances in neuroimaging have deepened our understanding of the brain's complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.

preprint2022arXiv

Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm

We improve on GenASM, a recent algorithm for genomic sequence alignment, by significantly reducing its memory footprint and bandwidth requirement. Our algorithmic improvements reduce the memory footprint by 24$\times$ and the number of memory accesses by 12$\times$. We efficiently parallelize the algorithm for GPUs, achieving a 4.1$\times$ speedup over a CPU implementation of the same algorithm, a 62$\times$ speedup over minimap2's CPU-based KSW2 and a 7.2$\times$ speedup over the CPU-based Edlib for long reads.

preprint2022arXiv

COVIDHunter: An Accurate, Flexible, and Environment-Aware Open-Source COVID-19 Outbreak Simulation Model

Background: Early detection and isolation of COVID-19 patients are essential for successful implementation of mitigation strategies and eventually curbing the disease spread. With a limited number of daily COVID-19 tests performed in every country, simulating the COVID-19 spread along with the potential effect of each mitigation strategy currently remains one of the most effective ways in managing the healthcare system and guiding policy-makers. Methods: We introduce COVIDHunter, a flexible and accurate COVID-19 outbreak simulation model that evaluates the current mitigation measures that are applied to a region and provides suggestions on what strength the upcoming mitigation measure should be. The key idea of COVIDHunter is to quantify the spread of COVID-19 in a geographical region by simulating the average number of new infections caused by an infected person considering the effect of external factors, such as environmental conditions (e.g., climate, temperature, humidity) and mitigation measures. Results: Using Switzerland as a case study, COVIDHunter estimates that if the policy-makers relax the mitigation measures by 50% for 30 days then both the daily capacity need for hospital beds and daily number of deaths increase exponentially by an average of 5.1x, who may occupy ICU beds and ventilators for a period of time. Unlike existing models, the COVIDHunter model accurately monitors and predicts the daily number of cases, hospitalizations, and deaths due to COVID-19. Our model is flexible to configure and simple to modify for modeling different scenarios under different environmental conditions and mitigation measures. Availability: We release the source code of the COVIDHunter implementation at https://github.com/CMU- SAFARI/COVIDHunter and show how to flexibly configure our model for any scenario and easily extend it for different measures and conditions than we account for.

preprint2022arXiv

Demeter: A Fast and Energy-Efficient Food Profiler using Hyperdimensional Computing in Memory

Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art profilers are unfortunately too costly for food profiling. Our goal is to design a food profiler that solves the main limitations of existing profilers, namely (1) working on massive data structures and (2) incurring considerable data movement for a real-time monitoring system. To this end, we propose Demeter, the first platform-independent framework for food profiling. Demeter overcomes the first limitation through the use of hyperdimensional computing (HDC) and efficiently performs the accurate few-species classification required in food profiling. We overcome the second limitation by using an in-memory hardware accelerator for Demeter (named Acc-Demeter) based on memristor devices. Acc-Demeter actualizes several domain-specific optimizations and exploits the inherent characteristics of memristors to improve the overall performance and energy consumption of Acc-Demeter. We compare Demeter's accuracy with other industrial food profilers using detailed software modeling. We synthesize Acc-Demeter's required hardware using UMC's 65nm library by considering an accurate PCM model based on silicon-based prototypes. Our evaluations demonstrate that Acc-Demeter achieves a (1) throughput improvement of 192x and 724x and (2) memory reduction of 36x and 33x compared to Kraken2 and MetaCache (2 state-of-the-art profilers), respectively, on typical food-related databases. Demeter maintains an acceptable profiling accuracy (within 2% of existing tools) and incurs a very low area overhead.

preprint2022arXiv

Exploiting Near-Data Processing to Accelerate Time Series Analysis

Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, and more. Matrix profile, the state-of-the-art algorithm to perform time series analysis, computes the most similar subsequence for a given query subsequence within a sliced time series. Matrix profile has low arithmetic intensity, but it typically operates on large amounts of time series data. In current computing systems, this data needs to be moved between the off-chip memory units and the on-chip computation units for performing matrix profile. This causes a major performance bottleneck as data movement is extremely costly in terms of both execution time and energy. In this work, we present NATSA, the first Near-Data Processing accelerator for time series analysis. The key idea is to exploit modern 3D-stacked High Bandwidth Memory (HBM) to enable efficient and fast specialized matrix profile computation near memory, where time series data resides. NATSA provides three key benefits: 1) quickly computing the matrix profile for a wide range of applications by building specialized energy-efficient floating-point arithmetic processing units close to HBM, 2) improving the energy efficiency and execution time by reducing the need for data movement over slow and energy-hungry buses between the computation units and the memory units, and 3) analyzing time series data at scale by exploiting low-latency, high-bandwidth, and energy-efficient memory access provided by HBM. Our experimental evaluation shows that NATSA improves performance by up to 14.2x (9.9x on average) and reduces energy by up to 27.2x (19.4x on average), over the state-of-the-art multi-core implementation. NATSA also improves performance by 6.3x and reduces energy by 10.2x over a general-purpose NDP platform with 64 in-order cores.

preprint2022arXiv

Going From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis

We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent. The analysis script and data used in our experimental evaluation are available at: https://github.com/CMU-SAFARI/Molecules2Variations

preprint2022arXiv

Packaging, containerization, and virtualization of computational omics methods: Advances, challenges, and opportunities

Omics software tools have reshaped the landscape of modern biology and become an essential component of biomedical research. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging, virtualization, and containerization are different approaches to satisfy this need by wrapping omics tools in additional software that makes the omics tools easier to install and use. Here, we systematically review practices across prominent packaging, virtualization, and containerization platforms. We outline the challenges, advantages, and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers, and system administrators. We also propose principles to make packaging, virtualization, and containerization of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.

preprint2022arXiv

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph and sequence-to-sequence mapping pipelines.

preprint2020arXiv

FlexWatts: A Power- and Workload-Aware Hybrid Power Delivery Network for Energy-Efficient Microprocessors

Modern client processors typically use one of three commonly-used power delivery network (PDN): 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-efficiency of each of these PDNs varies with the processor power (e.g., thermal design power (TDP) and dynamic power-state) and workload characteristics. This leads to energy inefficiency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads. We propose FlexWatts, a hybrid adaptive PDN for modern client processors whose goal is to provide high energy-efficiency across the processor's wide range of power consumption and workloads by dynamically allocating PDNs to processor domains. FlexWatts is based on three key ideas. First, it combines IVRs and LDOs in a novel way to share multiple on-chip and off-chip resources. This hybrid PDN is allocated for processor domains with a wide power consumption range and it dynamically switches between two modes: IVR-Mode and LDO-Mode, depending on the power consumption. Second, for all other processor domains, FlexWatts statically allocates off-chip VRs. Third, FlexWatts introduces a prediction algorithm that switches the hybrid PDN to the mode that is the most beneficial. To evaluate the tradeoffs of PDNs, we develop and open-source PDNspot, the first validated architectural PDN model that enables quantitative analysis of PDN metrics. Using PDNspot, we evaluate FlexWatts on a wide variety of SPEC CPU2006, 3DMark06, and battery life workloads against IVR, the state-of-the-art PDN in modern client processors. For a 4W TDP processor, FlexWatts improves the average performance of the SPEC CPU2006 and 3DMark06 workloads by 22% and 25%, respectively. FlexWatts has comparable cost and area overhead to IVR.

preprint2020arXiv

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM). We propose GenASM, the first ASM acceleration framework for genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint, and we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth. We demonstrate that GenASM is a flexible, high-performance, and low-power framework, which provides significant performance and power benefits for three different use cases in genome sequence analysis: 1) GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while consuming 37x and 2.7x less power. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. 2) GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while consuming 1.7x less power and significantly improving the filtering accuracy. 3) GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while consuming 548-582x and 67x less power.

preprint2020arXiv

SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors

There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and memory domains can operate at a higher frequency and voltage than necessary, increasing power consumption and 2) the unused power budget of the IO and memory domains cannot be used to increase the throughput of the compute domain, hampering performance. To avoid these issues, it is crucial to dynamically orchestrate the distribution of the SoC power budget across the three domains based on their actual performance demands. We propose SysScale, a new multi-domain power management technique to improve the energy efficiency of mobile SoCs. SysScale is based on three key ideas. First, SysScale introduces an accurate algorithm to predict the performance (e.g., bandwidth and latency) demands of the three SoC domains. Second, SysScale uses a new DVFS (dynamic voltage and frequency scaling) mechanism to distribute the SoC power to each domain according to the predicted performance demands. Third, in addition to using a global DVFS mechanism, SysScale uses domain-specialized techniques to optimize the energy efficiency of each domain at different operating points. We implement SysScale on an Intel Skylake microprocessor for mobile devices and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and 8.9% (9.2% and 7.9% on average), respectively.

preprint2017arXiv

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies

Motivation: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. Results: We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x--6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x--3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. Availability: The code is available online at: https://github.com/CMU-SAFARI/GRIM