Source author record

Chi-Ying Tsui

Chi-Ying Tsui appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Information Theory math.IT Artificial Intelligence Computer Vision eess.IV eess.SP Machine Learning Other Computer Science physics.app-ph

Catalog footprint

What is connected

8works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A 28nm 0.22μJ/token memory-compute-intensity-aware CNN-Transformer accelerator with hybrid-attention-based layer-fusion and cascaded pruning for semantic-segmentation

This work presents a 28nm 13.93mm2 CNN-Transformer accelerator for semantic segmentation, achieving 3.86-to-10.91x energy reduction over previous designs. It features a hybrid attention unit, layer-fusion scheduler, and cascaded feature-map pruner, with peak energy efficiency of 52.90TOPS/W (INT8).

preprint2026arXiv

DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models

Stochastic computing (SC) offers hardware simplicity but suffers from low throughput, while high-throughput Digital Computing-in-Memory (DCIM) is bottlenecked by costly adder logic for matrix-vector multiplication (MVM). To address this trade-off, this paper introduces a digital stochastic CIM (DS-CIM) architecture that achieves both high accuracy and efficiency. We implement signed multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation. Throughput is enhanced by replicating this low-cost circuit 64 times with only a 1x area increase. Our core strategy, a shared Pseudo Random Number Generator (PRNG) with 2D partitioning, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions. We also resolve the 1s saturation issue via stochastic process analysis and data remapping, significantly improving accuracy and resilience to input sparsity. Our high-accuracy DS-CIM1 variant achieves 94.45% accuracy for INT8 ResNet18 on CIFAR-10 with a root-mean-squared error (RMSE) of just 0.74%. Meanwhile, our high-efficiency DS-CIM2 variant attains an energy efficiency of 3566.1 TOPS/W and an area efficiency of 363.7 TOPS/mm^2, while maintaining a low RMSE of 3.81%. The DS-CIM capability with larger models is further demonstrated through experiments with INT8 ResNet50 on ImageNet and the FP8 LLaMA-7B model.

preprint2023arXiv

Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

Recent literature has shown that convolutional neural networks (CNNs) with large kernels outperform vision transformers (ViTs) and CNNs with stacked small kernels in many computer vision tasks, such as object detection and image restoration. The Winograd transformation helps reduce the number of repetitive multiplications in convolution and is widely supported by many commercial AI processors. Researchers have proposed accelerating large kernel convolutions by linearly decomposing them into many small kernel convolutions and then sequentially accelerating each small kernel convolution with the Winograd algorithm. This work proposes a nested Winograd algorithm that iteratively decomposes a large kernel convolution into small kernel convolutions and proves it to be more effective than the linear decomposition Winograd transformation algorithm. Experiments show that compared to the linear decomposition Winograd algorithm, the proposed algorithm reduces the total number of multiplications by 1.4 to 10.5 times for computing 4x4 to 31x31 convolutions.

preprint2021arXiv

Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Wireless power transfer using inductive coupling is commonly used for medical implantable devices. The design of the secondary coil on the implantable device is important as it will affect the power transfer efficiency, the size of the implant, and also the data transmission between the implant and the in-vitro controller. In this paper, we present a design of the secondary coil on a polyimide-based flexible substrate to achieve high power transfer efficiency. Load shift keying modulation is used for the data communication between the primary and secondary coils. A thorough analysis is done for the ideal and practical scenario and it shows that a mismatched secondary LC tank will affect the communication range and communication correctness. A solution to achieve robust data transmission is proposed and then verified by SPICE simulations.

preprint2016arXiv

Hardware Decoders for Polar Codes: An Overview

Polar codes are an exciting new class of error correcting codes that achieve the symmetric capacity of memoryless channels. Many decoding algorithms were developed and implemented, addressing various application requirements: from error-correction performance rivaling that of LDPC codes to very high throughput or low-complexity decoders. In this work, we review the state of the art in polar decoders implementing the successive-cancellation, belief propagation, and list decoding algorithms, illustrating their advantages.

preprint2015arXiv

Low Complexity Belief Propagation Polar Code Decoders

Since its invention, polar code has received a lot of attention because of its capacity-achieving performance and low encoding and decoding complexity. Successive cancellation decoding (SCD) and belief propagation decoding (BPD) are two of the most popular approaches for decoding polar codes. SCD is able to achieve good error-correcting performance and is less computationally expensive as compared to BPD. However SCDs suffer from long latency and low throughput due to the serial nature of the successive cancellation algorithm. BPD is parallel in nature and hence is more attractive for high throughput applications. However since it is iterative in nature, the required latency and energy dissipation increases linearly with the number of iterations. In this work, we borrow the idea of SCD and propose a novel scheme based on sub-factor-graph freezing to reduce the average number of computations as well as the average number of iterations required by BPD, which directly translates into lower latency and energy dissipation. Simulation results show that the proposed scheme has no performance degradation and achieves significant reduction in computation complexity over the existing methods.

preprint2015arXiv

Low-latency List Decoding Of Polar Codes With Double Thresholding

For polar codes with short-to-medium code length, list successive cancellation decoding is used to achieve a good error-correcting performance. However, list pruning in the current list decoding is based on the sorting strategy and its timing complexity is high. This results in a long decoding latency for large list size. In this work, aiming at a low-latency list decoding implementation, a double thresholding algorithm is proposed for a fast list pruning. As a result, with a negligible performance degradation, the list pruning delay is greatly reduced. Based on the double thresholding, a low-latency list decoding architecture is proposed and implemented using a UMC 90nm CMOS technology. Synthesis results show that, even for a large list size of 16, the proposed low-latency architecture achieves a decoding throughput of 220 Mbps at a frequency of 641 MHz.

preprint2007arXiv

Exploiting Dynamic Workload Variation in Low Energy Preemptive Task Scheduling

A novel energy reduction strategy to maximally exploit the dynamic workload variation is proposed for the offline voltage scheduling of preemptive systems. The idea is to construct a fully-preemptive schedule that leads to minimum energy consumption when the tasks take on approximately the average execution cycles yet still guarantees no deadline violation during the worst-case scenario. End-time for each sub-instance of the tasks obtained from the schedule is used for the on-line dynamic voltage scaling (DVS) of the tasks. For the tasks that normally require a small number of cycles but occasionally a large number of cycles to complete, such a schedule provides more opportunities for slack utilization and hence results in larger energy saving. The concept is realized by formulating the problem as a Non-Linear Programming (NLP) optimization problem. Experimental results show that, by using the proposed scheme, the total energy consumption at runtime is reduced by as high as 60% for randomly generated task sets when comparing with the static scheduling approach only using worst case workload.

Chi-Ying Tsui

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

A 28nm 0.22μJ/token memory-compute-intensity-aware CNN-Transformer accelerator with hybrid-attention-based layer-fusion and cascaded pruning for semantic-segmentation

DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models

Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Hardware Decoders for Polar Codes: An Overview

Low Complexity Belief Propagation Polar Code Decoders

Low-latency List Decoding Of Polar Codes With Double Thresholding

Exploiting Dynamic Workload Variation in Low Energy Preemptive Task Scheduling