Researcher profile

Mo Zou

Mo Zou contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. A straightforward hardwiring of gpt-oss 120 B would require fabricating photomask sets valued at over 6 billion dollars, rendering this straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 photomask layers are homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x that of GPU/WSE), 36 tokens/J (1,047x/283x that of GPU/WSE), 13,232 mm2 total die area, $59.46 M-123.5 M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 41.7-80.4x improvement in cost-effectiveness and 357x reduction in carbon footprint compared to OpenAI-scale H100 clusters, under an annual weight updating assumption.

preprint2022arXiv

Alleviating Datapath Conflicts and Design Centralization in Graph Analytics Acceleration

Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2x speedup (1.5x on average) as well as better scalability.

preprint2022arXiv

Characterizing and Understanding Distributed GNN Training on GPUs

Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which accelerates training using multiple computing nodes. Maximizing the performance is essential, but the execution of distributed GNN training remains preliminarily understood. In this work, we provide an in-depth analysis of distributed GNN training on GPUs, revealing several significant observations and providing useful guidelines for both software optimization and hardware optimization.

preprint2022arXiv

Characterizing and Understanding HGNNs on GPUs

Heterogeneous graph neural networks (HGNNs) deliver powerful capacity in heterogeneous graph representation learning. The execution of HGNNs is usually accelerated by GPUs. Therefore, characterizing and understanding the execution pattern of HGNNs on GPUs is important for both software and hardware optimizations. Unfortunately, there is no detailed characterization effort of HGNN workloads on GPUs. In this paper, we characterize HGNN workloads at inference phase and explore the execution of HGNNs on GPU, to disclose the execution semantic and execution pattern of HGNNs. Given the characterization and exploration, we propose several useful guidelines for both software and hardware optimizations for the efficient execution of HGNNs on GPUs.