Researcher profile

Ziyu Wang

Ziyu Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2026arXiv

PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning

Personalized healthcare decisions require reasoning about how physiological and behavioral variables influence an individual patient over time. Existing temporal causal discovery methods are poorly matched to this setting: cohort-level models provide stable but non-personalized structures, while per-patient discovery is unreliable because individual trajectories are short, noisy, irregular, and non-stationary. This creates a fundamental gap between population-level causal modeling and the patient-specific, time-varying mechanisms needed for intervention reasoning. We introduce PerCaM-Health, a framework for learning personalized dynamic causal graphs from longitudinal health data. The framework learns a knowledge-guided population temporal graph, then conservatively adapts and evolves it using patient-specific temporal evidence and rolling-window updates, producing interpretable and auditable graph sequences. By coupling these graphs with temporal structural equations, the framework enables patient-level counterfactual queries, such as estimating short-horizon outcome changes under hypothetical behavioral interventions. Experiments on a semi-synthetic dynamic health benchmark show that PerCaM-Health improves graph recovery, dynamic edge tracking, and intervention direction accuracy compared to cohort-level, per-patient, and non-personalized temporal baselines. These results demonstrate that jointly modeling personalization and temporal evolution yields more reliable causal structure and intervention reasoning.

preprint2026arXiv

The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.

preprint2022arXiv

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality). To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.

preprint2022arXiv

C$^2$SP-Net: Joint Compression and Classification Network for Epilepsy Seizure Prediction

Recent development in brain-machine interface technology has made seizure prediction possible. However, the communication of large volume of electrophysiological signals between sensors and processing apparatus and related computation become two major bottlenecks for seizure prediction systems due to the constrained bandwidth and limited computation resource, especially for wearable and implantable medical devices. Although compressive sensing (CS) can be adopted to compress the signals to reduce communication bandwidth requirement, it needs a complex reconstruction procedure before the signal can be used for seizure prediction. In this paper, we propose C$^2$SP-Net, to jointly solve compression, prediction, and reconstruction with a single neural network. A plug-and-play in-sensor compression matrix is constructed to reduce transmission bandwidth requirement. The compressed signal can be used for seizure prediction without additional reconstruction steps. Reconstruction of the original signal can also be carried out in high fidelity. Prediction accuracy, sensitivity, false prediction rate, and reconstruction quality of the proposed framework are evaluated under various compression ratios. The experimental results illustrate that our model outperforms the competitive state-of-the-art baselines by a large margin in prediction accuracy. In particular, our proposed method produces an average loss of 0.35 % in prediction accuracy with a compression ratio ranging from 1/2 to 1/16.

preprint2022arXiv

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Recently, MLP-like vision models have achieved promising performances on mainstream visual recognition tasks. In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models. However, existing MLP-like models fuse tokens through static fusion operations, lacking adaptability to the contents of the tokens to be mixed. Thus, customary information fusion procedures are not effective enough. To this end, this paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion. Critically, we propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed. To reduce the time complexity and improve the robustness, a dimensionality reduction technique and a multi-segment fusion mechanism are adopted. Our proposed DynaMixer model (97M parameters) achieves 84.3\% top-1 accuracy on the ImageNet-1K dataset without extra training data, performing favorably against the state-of-the-art vision MLP models. When the number of parameters is reduced to 26M, it still achieves 82.7\% top-1 accuracy, surpassing the existing MLP-like models with a similar capacity. The code is available at \url{https://github.com/ziyuwwang/DynaMixer}.

preprint2022arXiv

Generative Deformable Radiance Fields for Disentangled Image Synthesis of Topology-Varying Objects

3D-aware generative models have demonstrated their superb performance to generate 3D neural radiance fields (NeRF) from a collection of monocular 2D images even for topology-varying object categories. However, these methods still lack the capability to separately control the shape and appearance of the objects in the generated radiance fields. In this paper, we propose a generative model for synthesizing radiance fields of topology-varying objects with disentangled shape and appearance variations. Our method generates deformable radiance fields, which builds the dense correspondence between the density fields of the objects and encodes their appearances in a shared template field. Our disentanglement is achieved in an unsupervised manner without introducing extra labels to previous 3D-aware GAN training. We also develop an effective image inversion scheme for reconstructing the radiance field of an object in a real monocular image and manipulating its shape and appearance. Experiments show that our method can successfully learn the generative model from unstructured monocular images and well disentangle the shape and appearance for objects (e.g., chairs) with large topological variance. The model trained on synthetic data can faithfully reconstruct the real object in a given single image and achieve high-quality texture and shape editing results.

preprint2022arXiv

Kubric: A scalable dataset generator

Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real data at scale is difficult, expensive, and frequently raises additional privacy, fairness and legal concerns. Synthetic data is a powerful tool with the potential to address these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent or mitigate problems regarding bias, privacy and licensing. Unfortunately, software tools for effective data generation are less mature than those for architecture design and training, which leads to fragmented generation efforts. To address these problems we introduce Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines, and generating TBs of data. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation. We release Kubric, the used assets, all of the generation code, as well as the rendered datasets for reuse and modification.

preprint2022arXiv

NeReF: Neural Refractive Field for Fluid Surface Reconstruction and Implicit Representation

Existing neural reconstruction schemes such as Neural Radiance Field (NeRF) are largely focused on modeling opaque objects. We present a novel neural refractive field(NeReF) to recover wavefront of transparent fluids by simultaneously estimating the surface position and normal of the fluid front. Unlike prior arts that treat the reconstruction target as a single layer of the surface, NeReF is specifically formulated to recover a volumetric normal field with its corresponding density field. A query ray will be refracted by NeReF according to its accumulated refractive point and normal, and we employ the correspondences and uniqueness of refracted ray for NeReF optimization. We show NeReF, as a global optimization scheme, can more robustly tackle refraction distortions detrimental to traditional methods for correspondence matching. Furthermore, the continuous NeReF representation of wavefront enables view synthesis as well as normal integration. We validate our approach on both synthetic and real data and show it is particularly suitable for sparse multi-view acquisition. We hence build a small light field array and experiment on various surface shapes to demonstrate high fidelity NeReF reconstruction.

preprint2022arXiv

Recent Advances in Tunable Metasurfaces: Materials, Design and Applications

Metasurfaces, a two-dimensional (2D) form of metamaterials constituted by planar meta-atoms, exhibit exotic abilities to freely tailor electromagnetic (EM) waves. Over the past decade, tunable metasurfaces have come to the frontier in the field of nanophotonics, with tremendous effort focused on developing and integrating various active materials into metasurfaces. As a result, tunable/reconfigurable metasurfaces with multi-functionalities triggered by various external stimuli have been successfully demonstrated, openings a new avenue to dynamically manipulate and control EM waves for photonic applications in demand. In this review, we first brief the progress of tunable metasurfaces development in the last decade and highlight representative works from the perspectives of active materials development, design methodologies and application-driven exploration. Then, we elaborate on the active tuning mechanisms and relevant active materials. Next, we discuss recent achievements in theory as well as machine learning (ML) assisted design methodologies to sustain the development of this field. After that, we summarize and describe typical application areas of the tunable metasurfaces. We conclude this review by analyzing existing challenges and presenting our perspectives on future directions and opportunities in this vibrant and fast-developing field.

preprint2022arXiv

SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization

Matching-based methods, especially those based on space-time memory, are significantly ahead of other solutions in semi-supervised video object segmentation (VOS). However, continuously growing and redundant template features lead to an inefficient inference. To alleviate this, we propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features. Different from the previous methods which only detect feature redundancy between frames, SWEM merges both intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm. Further, adaptive weights for frame features endow SWEM with the flexibility to represent hard samples, improving the discrimination of templates. Besides, the proposed method maintains a fixed number of template features in memory, which ensures the stable inference complexity of the VOS system. Extensive experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3\% $\mathcal{J}\&\mathcal{F}$ on DAVIS 2017 validation dataset) of SWEM. Code is available at: https://github.com/lmm077/SWEM.

preprint2022arXiv

Tailoring topological transition of anisotropic polaritons by interface engineering in biaxial crystals

Polaritons in polar biaxial crystals with extreme anisotropy offer a promising route to manipulate nanoscale light-matter interactions. The dynamical modulation of their dispersion is great significance for future integrated nano-optics but remains challenging. Here, we report a momentum-directed strategy, a coupling between the modes with extra momentum supported by the interface and in-plane hyperbolic polaritons, to tailor topological transitions of anisotropic polaritons in biaxial crystals. We experimentally demonstrate such tailored polaritons at the interface of heterostructures between graphene and α-phase molybdenum trioxide (α-MoO3). The interlayer coupling can be electrically modulated by changing the Fermi level in graphene, enabling a dynamic topological transition. More interestingly, we found that the topological transition occurs at a constant Fermi level when tuning the thickness of α-MoO3. The momentum-directed strategy implemented by interface engineering offers new insights for optical topological transitions, which may shed new light for programmable polaritonics, energy transfer and neuromorphic photonics.

preprint2021arXiv

Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings

Cycle-consistent training is widely used for jointly learning a forward and inverse mapping between two domains of interest without the cumbersome requirement of collecting matched pairs within each domain. In this regard, the implicit assumption is that there exists (at least approximately) a ground-truth bijection such that a given input from either domain can be accurately reconstructed from successive application of the respective mappings. But in many applications no such bijection can be expected to exist and large reconstruction errors can compromise the success of cycle-consistent training. As one important instance of this limitation, we consider practically-relevant situations where there exists a many-to-one or surjective mapping between domains. To address this regime, we develop a conditional variational autoencoder (CVAE) approach that can be viewed as converting surjective mappings to implicit bijections whereby reconstruction errors in both directions can be minimized, and as a natural byproduct, realistic output diversity can be obtained in the one-to-many direction. As theoretical motivation, we analyze a simplified scenario whereby minima of the proposed CVAE-based energy function align with the recovery of ground-truth surjective mappings. On the empirical side, we consider a synthetic image dataset with known ground-truth, as well as a real-world application involving natural language generation from knowledge graphs and vice versa, a prototypical surjective case. For the latter, our CVAE pipeline can capture such many-to-one mappings during cycle training while promoting textural diversity for graph-to-text tasks. Our code is available at github.com/QipengGuo/CycleGT *A condensed version of this paper has been accepted to AISTATS 2021. This version contains additional content and updates.

preprint2021arXiv

RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning

Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concerns. In this paper, we propose a benchmark called RL Unplugged to evaluate and compare offline RL methods. RL Unplugged includes data from a diverse range of domains including games (e.g., Atari benchmark) and simulated motor control problems (e.g., DM Control Suite). The datasets include domains that are partially or fully observable, use continuous or discrete actions, and have stochastic vs. deterministic dynamics. We propose detailed evaluation protocols for each domain in RL Unplugged and provide an extensive analysis of supervised learning and offline RL methods using these protocols. We will release data for all our tasks and open-source all algorithms presented in this paper. We hope that our suite of benchmarks will increase the reproducibility of experiments and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community. Moving forward, we view RL Unplugged as a living benchmark suite that will evolve and grow with datasets contributed by the research community and ourselves. Our project page is available on https://git.io/JJUhd.

preprint2020arXiv

A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models

Score matching provides an effective approach to learning flexible unnormalized models, but its scalability is limited by the need to evaluate a second-order derivative. In this paper, we present a scalable approximation to a general family of learning objectives including score matching, by observing a new connection between these objectives and Wasserstein gradient flows. We present applications with promise in learning neural density estimators on manifolds, and training implicit variational and Wasserstein auto-encoders with a manifold-valued prior.

preprint2020arXiv

Hyperparameter Selection for Offline Reinforcement Learning

Offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios. However, existing hyperparameter selection methods for offline RL break the offline assumption by evaluating policies corresponding to each hyperparameter setting in the environment. This online execution is often infeasible and hence undermines the main aim of offline RL. Therefore, in this work, we focus on \textit{offline hyperparameter selection}, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data. Through large-scale empirical evaluation we show that: 1) offline RL algorithms are not robust to hyperparameter choices, 2) factors such as the offline RL algorithm and method for estimating Q values can have a big impact on hyperparameter selection, and 3) when we control those factors carefully, we can reliably rank policies across hyperparameter choices, and therefore choose policies which are close to the best policy in the set. Overall, our results present an optimistic view that offline hyperparameter selection is within reach, even in challenging tasks with pixel observations, high dimensional action spaces, and long horizon.

preprint2020arXiv

Learning Interpretable Representation for Controllable Polyphonic Music Generation

While deep generative models have become the leading methods for algorithmic composition, it remains a challenging problem to control the generation process because the latent variables of most deep-learning models lack good interpretability. Inspired by the content-style disentanglement idea, we design a novel architecture, under the VAE framework, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. The current model focuses on learning 8-beat long piano composition segments. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications, including compositional style transfer, texture variation, and accompaniment arrangement. Both objective and subjective evaluations show that our method achieves a successful disentanglement and high quality controlled music generation.

preprint2020arXiv

PIANOTREE VAE: Structured Representation Learning for Polyphonic Music

The dominant approach for music representation learning involves the deep unsupervised model family variational autoencoder (VAE). However, most, if not all, viable attempts on this problem have largely been limited to monophonic music. Normally composed of richer modality and more complex musical structures, the polyphonic counterpart has yet to be addressed in the context of music representation learning. In this work, we propose the PianoTree VAE, a novel tree-structure extension upon VAE aiming to fit the polyphonic music learning. The experiments prove the validity of the PianoTree VAE via (i)-semantically meaningful latent code for polyphonic segments; (ii)-more satisfiable reconstruction aside of decent geometry learned in the latent space; (iii)-this model's benefits to the variety of the downstream music generation.

preprint2020arXiv

POP909: A Pop-song Dataset for Music Arrangement Generation

Music arrangement generation is a subtask of automatic music generation, which involves reconstructing and re-conceptualizing a piece with new compositional techniques. Such a generation process inevitably requires reference from the original melody, chord progression, or other structural information. Despite some promising models for arrangement, they lack more refined data to achieve better evaluations and more practical results. In this paper, we propose POP909, a dataset which contains multiple versions of the piano arrangements of 909 popular songs created by professional musicians. The main body of the dataset contains the vocal melody, the lead instrument melody, and the piano accompaniment for each song in MIDI format, which are aligned to the original audio files. Furthermore, we provide the annotations of tempo, beat, key, and chords, where the tempo curves are hand-labeled and others are done by MIR algorithms. Finally, we conduct several baseline experiments with this dataset using standard deep music generation algorithms.

preprint2020arXiv

Scaling data-driven robotics with reward sketching and batch reinforcement learning

We present a framework for data-driven robotics that makes use of a large dataset of recorded robot experience and scales to several tasks using learned reward functions. We show how to apply this framework to accomplish three different object manipulation tasks on a real robot platform. Given demonstrations of a task together with task-agnostic recorded experience, we use a special form of human annotation as supervision to learn a reward function, which enables us to deal with real-world tasks where the reward signal cannot be acquired directly. Learned rewards are used in combination with a large dataset of experience from different tasks to learn a robot policy offline using batch RL. We show that using our approach it is possible to train agents to perform a variety of challenging manipulation tasks including stacking rigid objects and handling cloth.