Source author record

Junsung Kim

Junsung Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence cond-mat.supr-con Distributed, Parallel, and Cluster Computing Hardware Architecture Machine Learning

Catalog footprint

What is connected

2works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

preprint2019arXiv

Lifted electron pocket and reversed orbital occupancy imbalance in FeSe

The FeSe nematic phase has been the focus of recent research on iron based superconductors (IBSs) due to its unique properties. A number of electronic structure studies were performed to find the origin of the phase. However, such attempts came out with conflicting results and caused additional controversies. Here, we report results from angle resolved photoemission and X-ray absorption spectroscopy studies on FeSe with detwinning by a piezo stack. We have fully resolved band dispersions with orbital characters near the Brillouin zone corner which reveals absence of a Fermi pocket at the Y point in the 1Fe Brillouin zone. In addition, the occupation imbalance between dxz and dyz orbitals is found to be opposite to that of iron pnictides, which is consistent with the identified band characters. These results settle down controversial issues in the FeSe nematic phase and shed light on the origin of nematic phases in IBSs.