Source author record

Haowei Lu

Haowei Lu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing eess.SP eess.SY Machine Learning Systems and Control

Catalog footprint

What is connected

2works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A self-organizing multi-agent system for distributed voltage regulation

This paper presents a distributed voltage regulation method based on multi-agent system control and network self-organization for a large distribution network. The network autonomously organizes itself into small subnetworks through the epsilon decomposition of the sensitivity matrix, and agents group themselves into these subnetworks with the communication links being autonomously determined. Each subnetwork controls its voltage by locating the closest local distributed generation and optimizing their outputs. This simplifies and reduces the size of the optimization problem and the interaction requirements. This approach also facilitates adaptive grouping of the network by self-reorganizing to maintain a stable state in response to time-varying network requirements and changes. The effectiveness of the proposed approach is validated through simulations on a model of a real heavily-meshed secondary distribution network. Simulation results and comparisons with other methods demonstrate the ability of the subnetworks to autonomously and independently regulate the voltage and to adapt to unpredictable network conditions over time, thereby enabling autonomous and flexible distribution networks.

preprint2022arXiv

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.

Haowei Lu

What is connected

Connect this record

See the researcher in context

Building this map preview

2 published item(s)

A self-organizing multi-agent system for distributed voltage regulation

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training