Source author record

Dennis Sylvester

Dennis Sylvester appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Artificial Intelligence Cryptography and Security eess.SP Machine Learning

Catalog footprint

What is connected

5works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive. Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices. While traditional low-rank (LR) methods often incur sharp accuracy drops, BLR approaches such as Monarch and BLAST can better capture the underlying structure, thus preserving accuracy while reducing computations and memory footprints. In this work, we use roofline analysis to show that, although BLR methods achieve theoretical savings and practical speedups for single-token inference, multi-token inference often becomes memory-bound in practice, increasing latency despite compiler-level optimizations in PyTorch. To address this, we introduce custom Triton kernels with partial fusion and memory layout optimizations for both Monarch and BLAST. On memory-constrained NVIDIA GPUs such as Jetson Orin Nano and A40, our kernels deliver up to $3.76\times$ speedups and $3\times$ model size compression over PyTorch dense baselines using CUDA backend and compiler-level optimizations, while supporting various models including Llama-7/1B, GPT2-S, DiT-XL/2, and ViT-B. Our code is available at https://github.com/pabillam/mem-efficient-blr.

preprint2022arXiv

Hardware Acceleration for Third-Generation FHE and PSI Based on It

With the expansion of cloud services, serious concerns about the privacy of users' data arise due to the exposure of the unencrypted data to the server during computation. Various security primitives are under investigation to preserve privacy while evaluating private data, including Fully Homomorphic Encryption (FHE), Private Set Intersection (PSI), and others. However, the prohibitive processing time of these primitives hinders their practical applications. This work proposes and implements an architecture for accelerating third-generation FHE with Amazon Web Services (AWS) cloud FPGAs, marking the first hardware acceleration solution for third-generation FHE. We also introduce a novel unbalanced PSI protocol based on third-generation FHE, optimized for the proposed hardware architecture. Several algorithm-architecture co-optimization techniques are introduced to allow the communication and computation costs to be independent of the Sender's set size. The measurement results show that the proposed accelerator achieves $>21\times$ performance improvement compared to a software implementation for various crucial subroutines of third-generation FHE and the proposed PSI.

preprint2022arXiv

Millimeter-Scale Ultra-Low-Power Imaging System for Intelligent Edge Monitoring

Millimeter-scale embedded sensing systems have unique advantages over larger devices as they are able to capture, analyze, store, and transmit data at the source while being unobtrusive and covert. However, area-constrained systems pose several challenges, including a tight energy budget and peak power, limited data storage, costly wireless communication, and physical integration at a miniature scale. This paper proposes a novel 6.7$\times$7$\times$5mm imaging system with deep-learning and image processing capabilities for intelligent edge applications, and is demonstrated in a home-surveillance scenario. The system is implemented by vertically stacking custom ultra-low-power (ULP) ICs and uses techniques such as dynamic behavior-specific power management, hierarchical event detection, and a combination of data compression methods. It demonstrates a new image-correcting neural network that compensates for non-idealities caused by a mm-scale lens and ULP front-end. The system can store 74 frames or offload data wirelessly, consuming 49.6$μ$W on average for an expected battery lifetime of 7 days.

preprint2007arXiv

DVS for On-Chip Bus Designs Based on Timing Error Correction

On-chip buses are typically designed to meet performance constraints at worst-case conditions, including process corner, temperature, IR-drop, and neighboring net switching pattern. This can result in significant performance slack at more typical operating conditions. In this paper, we propose a dynamic voltage scaling (DVS) technique for buses, based on a double sampling latch which can detect and correct for delay errors without the need for retransmission. The proposed approach recovers the available slack at non-worst-case operating points through more aggressive voltage scaling and tracks changing conditions by monitoring the error recovery rate. Voltage margins needed in traditional designs to accommodate worst-case performance conditions are therefore eliminated, resulting in a significant improvement in energy efficiency. The approach was implemented for a 6mm memory read bus operating at 1.5GHz (0.13 $μ$m technology node) and was simulated for a number of benchmark programs. Even at the worst-case process and environment conditions, energy gains of up to 17% are achieved, with error recovery rates under 2.3%. At more typical process and environment conditions, energy gains range from 35% to 45%, with a performance degradation under 2%. An analysis of optimum interconnect architectures for maximizing energy gains with this approach shows that the proposed approach performs well with technology scaling.

preprint2007arXiv

Power-Performance Trade-Offs in Nanometer-Scale Multi-Level Caches Considering Total Leakage

In this paper, we investigate the impact of T_{ox} and Vth on power performance trade-offs for on-chip caches. We start by examining the optimization of the various components of a single level cache and then extend this to two level cache systems. In addition to leakage, our studies also account for the dynamic power expanded as a result of cache misses. Our results show that one can often reduce overall power by increasing the size of the L2 cache if we only allow one pair of Vth/T_{ox} in L2. However, if we allow the memory cells and the peripherals to have their own Vth's and T_{ox}'s, we show that a two-level cache system with smaller L2's will yield less total leakage. We further show that two Vth's and two T_{ox}'s are sufficient to get close to an optimal solution, and that Vth is generally a better design knob than T_{ox} for leakage optimization, thus it is better to restrict the number of T_{ox}'s rather than Vth's if cost is a concern.

Dennis Sylvester

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

Hardware Acceleration for Third-Generation FHE and PSI Based on It

Millimeter-Scale Ultra-Low-Power Imaging System for Intelligent Edge Monitoring

DVS for On-Chip Bus Designs Based on Timing Error Correction

Power-Performance Trade-Offs in Nanometer-Scale Multi-Level Caches Considering Total Leakage