Researcher profile

Jiaqi Gu

Jiaqi Gu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

H3PIMAP: A Heterogeneity-Aware Multi-Objective DNN Mapping Framework on Electronic-Photonic Processing-in-Memory Architectures

The future of artificial intelligence (AI) acceleration demands a paradigm shift beyond the limitations of purely electronic or photonic architectures. Photonic analog computing delivers unmatched speed and parallelism but struggles with data movement, robustness, and precision, while electronic processing-in-memory (PIM) enables energy-efficient computing by co-locating storage and computation but suffers from endurance and reconfiguration constraints, limiting it to static weight mapping. Neither approach alone achieves the balance needed for adaptive, efficient AI. To break this impasse, we study a hybrid electronic-photonic-PIM computing architecture and introduce H3PIMAP, a heterogeneity-aware mapping framework that seamlessly orchestrates workloads across electronic and optical tiers. By optimizing workload partitioning through a two-stage multi-objective exploration method, H3PIMAP harnesses light speed for high-throughput operations and PIM efficiency for memory-bound tasks. In system-level evaluations, H3PIMAP delivers a 3.32x latency reduction across language and vision models and, on large language models, achieves 77.0% lower latency with 14.6% lower energy at matched quality, outperforming homogeneous and naive mapping strategies. This proposed framework lays the foundation for hybrid AI accelerators, bridging the gap between electronic and photonic computation for next-generation efficiency and scalability.

preprint2025arXiv

Democratizing Electronic-Photonic AI Systems: An Open-Source AI-Infused Cross-Layer Co-Design and Design Automation Toolflow

Photonics is becoming a cornerstone technology for high-performance AI systems and scientific computing, offering unparalleled speed, parallelism, and energy efficiency. Despite this promise, the design and deployment of electronic-photonic AI systems remain highly challenging due to a steep learning curve across multiple layers, spanning device physics, circuit design, system architecture, and AI algorithms. The absence of a mature electronic-photonic design automation (EPDA) toolchain leads to long, inefficient design cycles and limits cross-disciplinary innovation and co-evolution. In this work, we present a cross-layer co-design and automation framework aimed at democratizing photonic AI system development. We begin by introducing our architecture designs for scalable photonic edge AI and Transformer inference, followed by SimPhony, an open-source modeling tool for rapid EPIC AI system evaluation and design-space exploration. We then highlight advances in AI-enabled photonic design automation, including physical AI-based Maxwell solvers, a fabrication-aware inverse design framework, and a scalable inverse training algorithm for meta-optical neural networks, enabling a scalable EPDA stack for next-generation electronic-photonic AI systems.

preprint2025arXiv

Toward Large-Scale Photonics-Empowered AI Systems: From Physical Design Automation to System-Algorithm Co-Exploration

In this work, we identify three considerations that are essential for realizing practical photonic AI systems at scale: (1) dynamic tensor operation support for modern models rather than only weight-static kernels, especially for attention/Transformer-style workloads; (2) systematic management of conversion, control, and data-movement overheads, where multiplexing and dataflow must amortize electronic costs instead of letting ADC/DAC and I/O dominate; and (3) robustness under hardware non-idealities that become more severe as integration density grows. To study these coupled tradeoffs quantitatively, and to ensure they remain meaningful under real implementation constraints, we build a cross-layer toolchain that supports photonic AI design from early exploration to physical realization. SimPhony provides implementation-aware modeling and rapid cross-layer evaluation, translating physical costs into system-level metrics so architectural decisions are grounded in realistic assumptions. ADEPT and ADEPT-Z enable end-to-end circuit and topology exploration, connecting system objectives to feasible photonic fabrics under practical device and circuit constraints. Finally, Apollo and LiDAR provide scalable photonic physical design automation, turning candidate circuits into manufacturable layouts while accounting for routing, thermal, and crosstalk constraints.

preprint2023arXiv

Lightening-Transformer: A Dynamically-operated Optically-interconnected Photonic Transformer Accelerator

The wide adoption and significant computing resource of attention-based transformers, e.g., Vision Transformers and large language models (LLM), have driven the demand for efficient hardware accelerators. There is a growing interest in exploring photonics as an alternative technology to digital electronics due to its high energy efficiency and ultra-fast processing speed. Photonic accelerators have shown promising results for CNNs, which mainly rely on weight-static linear operations. However, they encounter issues when efficiently supporting Transformer architectures, questioning the applicability of photonics to advanced ML tasks. The primary hurdle lies in their inefficiency in handling unique workloads in Transformers, i.e., dynamic and full-range tensor multiplication. In this work, we propose Lightening-Transformer, the first light-empowered, high-performance, and energy-efficient photonic Transformer accelerator. To overcome prior designs' fundamental limitations, we introduce a novel dynamically-operated photonic tensor core, DPTC, a crossbar array of interference-based optical vector dot-product engines supporting highly parallel, dynamic, and full-range matrix multiplication. Furthermore, we design a dedicated accelerator that integrates our novel photonic computing cores with photonic interconnects for inter-core data broadcast, fully unleashing the power of optics. Comprehensive evaluations show that ours achieves >2.6x energy and >12x latency reductions compared to prior photonic accelerators and delivers the lowest energy cost and 2 to 3 orders of magnitude lower energy-delay product compared to electronic Transformer accelerators, all while maintaining digital-comparable accuracy. Our work highlights the immense potential of photonics for advanced ML workloads, such as Transformer-backboned LLM. Our work is available at https://github.com/zhuhanqing/Lightening-Transformer.

preprint2023arXiv

M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

Photonic computing shows promise for transformative advancements in machine learning (ML) acceleration, offering ultra-fast speed, massive parallelism, and high energy efficiency. However, current photonic tensor core (PTC) designs based on standard optical components hinder scalability and compute density due to their large spatial footprint. To address this, we propose an ultra-compact PTC using customized programmable multi-operand multimode interference (MOMMI) devices, named M3ICRO. The programmable MOMMI leverages the intrinsic light propagation principle, providing a single-device programmable matrix unit beyond the conventional computing paradigm of one multiply-accumulate (MAC) operation per device. To overcome the optimization difficulty of customized devices that often requires time-consuming simulation, we apply ML for optics to predict the device behavior and enable a differentiable optimization flow. We thoroughly investigate the reconfigurability and matrix expressivity of our customized PTC, and introduce a novel block unfolding method to fully exploit the computing capabilities of a complex-valued PTC for near-universal real-valued linear transformations. Extensive evaluations demonstrate that M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x higher compute density, 3.7-12x higher system throughput, and superior noise robustness compared to state-of-the-art coherent PTC designs, while maintaining close-to-digital task accuracy across various ML benchmarks. Our code is open-sourced at https://github.com/JeremieMelo/M3ICRO-MOMMI.