Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
29works
0followers
19topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

29 published item(s)

preprint2026arXiv

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis over underexplored data environments. This burden is especially salient in complex financial analytics, where relevant evidence is rarely pre-specified. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. We introduce DataClawBench, a benchmark for exploratory real-world financial data analysis under limited prior guidance. DataClawBench contains approximately 2.06 million real-world records across enterprise, industry, and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones that diagnose exploration and reasoning failures beyond outcome accuracy. A systematic evaluation of eight advanced LLMs under the OpenClaw agent reveals that exploratory data analysis breaks agent reliability: more exploration does not reliably translate into task-relevant progress or correct final answers.

preprint2022arXiv

A Time-domain Real-valued Generalized Wiener Filter for Multi-channel Neural Separation Systems

Frequency-domain beamformers have been successful in a wide range of multi-channel neural separation systems in the past years. However, the operations in conventional frequency-domain beamformers are typically independently-defined and complex-valued, which result in two drawbacks: the former does not fully utilize the advantage of end-to-end optimization, and the latter may introduce numerical instability during the training phase. Motivated by the recent success in end-to-end neural separation systems, in this paper we propose time-domain real-valued generalized Wiener filter (TD-GWF), a linear filter defined on a 2-D learnable real-valued signal transform. TD-GWF splits the transformed representation into groups and performs an minimum mean-square error (MMSE) estimation on all available channels on each of the groups. We show how TD-GWF can be connected to conventional filter-and-sum beamformers when certain signal transform and the number of groups are specified. Moreover, given the recent success in the sequential neural beamforming frameworks, we show how TD-GWF can be applied in such frameworks to perform iterative beamforming and separation to obtain an overall performance gain. Comprehensive experiment results show that TD-GWF performs consistently better than conventional frequency-domain beamformers in the sequential neural beamforming pipeline with various neural network architectures, microphone array scenarios, and task configurations.

preprint2022arXiv

An Information-theoretical Secured Byzantine-fault Tolerance Consensus in Quantum Key Distribution Network

Quantum key distribution (QKD) networks is expected to provide information-theoretical secured (ITS) communication over long distances. QKD networks based trusted relay architecture are now the most widely used scheme in practice. However, it is an unrealistic assumption that all relays are fully trustable in complex networks. In the past, only a few studies have theoretically analyzed the case of passive eavesdropping attack by dishonest relays and corresponding defense method. However, we have found that active attacks by dishonest relays can be more threatening. With the consideration of passive and active attacks, we treat dishonest relays as Byzantine nodes and analyzes the upper limit of Byzantine nodes that the QKD network can accommodate. In this paper, we propose an ITS Byzantine-fault tolerance (BFT) QKD network scheme to achieve end-to-end key distribution based on point-to-point QKD links. To ensure consistency and provide BFT ability in the QKD network, we design an ITSBFT-consensus protocol for this network scheme. To ensure the information-theoretic security of consensus, we design a temporary signature scheme based on point-to-point QKD link keys. To prevent Byzantine nodes from disrupting the execution process of key distribution, we design an end-to-end key distribution scheme combined with consensus. We theoretically analyze proposed ITSBFT-QKD network scheme from four aspects: QKD key distribution security, temporary signature security, consensus security, and leader election fairness. The simulation result proved the feasibility and demonstrate the performance.

preprint2022arXiv

Analysis of Diffractive Neural Networks for Seeing Through Random Diffusers

Imaging through diffusive media is a challenging problem, where the existing solutions heavily rely on digital computers to reconstruct distorted images. We provide a detailed analysis of a computer-free, all-optical imaging method for seeing through random, unknown phase diffusers using diffractive neural networks, covering different deep learning-based training strategies. By analyzing various diffractive networks designed to image through random diffusers with different correlation lengths, a trade-off between the image reconstruction fidelity and distortion reduction capability of the diffractive network was observed. During its training, random diffusers with a range of correlation lengths were used to improve the diffractive network's generalization performance. Increasing the number of random diffusers used in each epoch reduced the overfitting of the diffractive network's imaging performance to known diffusers. We also demonstrated that the use of additional diffractive layers improved the generalization capability to see through new, random diffusers. Finally, we introduced deliberate misalignments in training to 'vaccinate' the network against random layer-to-layer shifts that might arise due to the imperfect assembly of the diffractive networks. These analyses provide a comprehensive guide in designing diffractive networks to see through random diffusers, which might profoundly impact many fields, such as biomedical imaging, atmospheric physics, and autonomous driving.

preprint2022arXiv

FRA-RIR: Fast Random Approximation of the Image-source Method

The training of modern speech processing systems often requires a large amount of simulated room impulse response (RIR) data in order to allow the systems to generalize well in real-world, reverberant environments. However, simulating realistic RIR data typically requires accurate physical modeling, and the acceleration of such simulation process typically requires certain computational platforms such as a graphics processing unit (GPU). In this paper, we propose FRA-RIR, a fast random approximation method of the widely-used image-source method (ISM), to efficiently generate realistic RIR data without specific computational devices. FRA-RIR replaces the physical simulation in the standard ISM by a series of random approximations, which significantly speeds up the simulation process and enables its application in on-the-fly data generation pipelines. Experiments show that FRA-RIR can not only be significantly faster than other existing ISM-based RIR simulation tools on standard computational platforms, but also improves the performance of speech denoising systems evaluated on real-world RIR when trained with simulated RIR. A Python implementation of FRA-RIR is available online\footnote{\url{https://github.com/yluo42/FRA-RIR}}.

preprint2022arXiv

Improving Choral Music Separation through Expressive Synthesized Data from Sampled Instruments

Choral music separation refers to the task of extracting tracks of voice parts (e.g., soprano, alto, tenor, and bass) from mixed audio. The lack of datasets has impeded research on this topic as previous work has only been able to train and evaluate models on a few minutes of choral music data due to copyright issues and dataset collection difficulties. In this paper, we investigate the use of synthesized training data for the source separation task on real choral music. We make three contributions: first, we provide an automated pipeline for synthesizing choral music data from sampled instrument plugins within controllable options for instrument expressiveness. This produces an 8.2-hour-long choral music dataset from the JSB Chorales Dataset and one can easily synthesize additional data. Second, we conduct an experiment to evaluate multiple separation models on available choral music separation datasets from previous work. To the best of our knowledge, this is the first experiment to comprehensively evaluate choral music separation. Third, experiments demonstrate that the synthesized choral data is of sufficient quality to improve the model's performance on real choral music datasets. This provides additional experimental statistics and data support for the choral music separation study.

preprint2022arXiv

Massively Parallel Universal Linear Transformations using a Wavelength-Multiplexed Diffractive Optical Network

We report deep learning-based design of a massively parallel broadband diffractive neural network for all-optically performing a large group of arbitrarily-selected, complex-valued linear transformations between an input and output field-of-view, each with N_i and N_o pixels, respectively. This broadband diffractive processor is composed of N_w wavelength channels, each of which is uniquely assigned to a distinct target transformation. A large set of arbitrarily-selected linear transformations can be individually performed through the same diffractive network at different illumination wavelengths, either simultaneously or sequentially (wavelength scanning). We demonstrate that such a broadband diffractive network, regardless of its material dispersion, can successfully approximate N_w unique complex-valued linear transforms with a negligible error when the number of diffractive neurons (N) in its design matches or exceeds 2 x N_w x N_i x N_o. We further report that the spectral multiplexing capability (N_w) can be increased by increasing N; our numerical analyses confirm these conclusions for N_w > 180, which can be further increased to e.g., ~2000 depending on the upper bound of the approximation error. Massively parallel, wavelength-multiplexed diffractive networks will be useful for designing high-throughput intelligent machine vision systems and hyperspectral processors that can perform statistical inference and analyze objects/scenes with unique spectral properties.

preprint2022arXiv

Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization

The key towards learning informative node representations in graphs lies in how to gain contextual information from the neighbourhood. In this work, we present a simple-yet-effective self-supervised node representation learning strategy via directly maximizing the mutual information between the hidden representations of nodes and their neighbourhood, which can be theoretically justified by its link to graph smoothing. Following InfoNCE, our framework is optimized via a surrogate contrastive loss, where the positive selection underpins the quality and efficiency of representation learning. To this end, we propose a topology-aware positive sampling strategy, which samples positives from the neighbourhood by considering the structural dependencies between nodes and thus enables positive selection upfront. In the extreme case when only one positive is sampled, we fully avoid expensive neighbourhood aggregation. Our methods achieve promising performance on various node classification datasets. It is also worth mentioning by applying our loss function to MLP based node encoders, our methods can be orders of faster than existing solutions. Our codes and supplementary materials are available at https://github.com/dongwei156/n2n.

preprint2022arXiv

On the Use of Deep Mask Estimation Module for Neural Source Separation Systems

Most of the recent neural source separation systems rely on a masking-based pipeline where a set of multiplicative masks are estimated from and applied to a signal representation of the input mixture. The estimation of such masks, in almost all network architectures, is done by a single layer followed by an optional nonlinear activation function. However, recent literatures have investigated the use of a deep mask estimation module and observed performance improvement compared to a shallow mask estimation module. In this paper, we analyze the role of such deeper mask estimation module by connecting it to a recently proposed unsupervised source separation method, and empirically show that the deep mask estimation module is an efficient approximation of the so-called overseparation-grouping paradigm with the conventional shallow mask estimation layers.

preprint2022arXiv

To image, or not to image: Class-specific diffractive cameras with all-optical erasure of undesired objects

Privacy protection is a growing concern in the digital era, with machine vision techniques widely used throughout public and private settings. Existing methods address this growing problem by, e.g., encrypting camera images or obscuring/blurring the imaged information through digital algorithms. Here, we demonstrate a camera design that performs class-specific imaging of target objects with instantaneous all-optical erasure of other classes of objects. This diffractive camera consists of transmissive surfaces structured using deep learning to perform selective imaging of target classes of objects positioned at its input field-of-view. After their fabrication, the thin diffractive layers collectively perform optical mode filtering to accurately form images of the objects that belong to a target data class or group of classes, while instantaneously erasing objects of the other data classes at the output field-of-view. Using the same framework, we also demonstrate the design of class-specific permutation cameras, where the objects of a target data class are pixel-wise permuted for all-optical class-specific encryption, while the other objects are irreversibly erased from the output image. The success of class-specific diffractive cameras was experimentally demonstrated using terahertz (THz) waves and 3D-printed diffractive layers that selectively imaged only one class of the MNIST handwritten digit dataset, all-optically erasing the other handwritten digits. This diffractive camera design can be scaled to different parts of the electromagnetic spectrum, including, e.g., the visible and infrared wavelengths, to provide transformative opportunities for privacy-preserving digital cameras and task-specific data-efficient imaging.

preprint2021arXiv

Atomic-Scale Probing of Heterointerface Phonon Bridges in Nitride Semiconductor

Interface phonon modes that are generated by several atomic layers at the heterointerface play a major role in the interface thermal conductance for nanoscale high-power devices such as nitride-based high-electron-mobility transistors and light emitting diodes. Here we measure the local phonon spectra across AlN/Si and AlN/Al interfaces using atomically resolved vibrational electron energy-loss spectroscopy in a scanning transmission electron microscope. At the AlN/Si interface, we observe various localized phonon modes, of which the extended and interfacial modes act as bridges to connect the bulk AlN modes and bulk Si modes, and are expected to boost the inelastic phonon transport thus substantially contribute to interface thermal conductance. In comparison, no such phonon bridge is observed at the AlN/Al interface, for which partially extended modes dominate the interface thermal conductivity. This work provides valuable insights into understanding the interfacial thermal transport in nitride semiconductors and useful guidance for thermal management via interface engineering.

preprint2021arXiv

Cascadable all-optical NAND gates using diffractive networks

Owing to its potential advantages such as scalability, low latency and power efficiency, optical computing has seen rapid advances over the last decades. A core unit of a potential all-optical processor would be the NAND gate, which can be cascaded to perform an arbitrary logical operation. Here, we present the design and analysis of cascadable all-optical NAND gates using diffractive neural networks. We encoded the logical values at the input and output planes of a diffractive NAND gate using the relative optical power of two spatially-separated apertures. Based on this architecture, we numerically optimized the design of a diffractive neural network composed of 4 passive layers to all-optically perform NAND operation using the diffraction of light, and cascaded these diffractive NAND gates to perform complex logical functions by successively feeding the output of one diffractive NAND gate into another. We demonstrated the cascadability of our diffractive NAND gates by using identical diffractive designs to all-optically perform AND and OR operations, as well as a half-adder. Cascadable all-optical NAND gates composed of spatially-engineered passive diffractive layers can serve as a core component of various optical computing platforms.

preprint2021arXiv

Characterization of exhaled e-cigarette aerosols in a vape shop using a field-portable holographic on-chip microscope

The past decade marked a drastic increase in the usage of electronic cigarettes (e-cigs). The adverse health impact of secondhand exposure due to exhaled e-cig particles has raised significant concerns, demanding further research on the characteristics of these particles. In this work, we report direct volatility measurements on exhaled e-cig aerosols using a field-portable device (termed c-Air) enabled by deep learning and lens-free holographic microscopy; for this analysis, we performed a series of field experiments in a vape shop where customers used/vaped their e-cig products. During four days of experiments, we periodically sampled the indoor air with intervals of ~15 minutes and collected the exhaled particles with c-Air. Time-lapse inline holograms of the collected particles were recorded by c-Air and reconstructed using a convolutional neural network yielding phase-recovered microscopic images of the particles. Volumetric decay of individual particles due to evaporation was used as an indicator of the volatility of each aerosol. Volatility dynamics quantified through c-Air experiments showed that indoor vaping increased the volatility of particles as well as the percentage of volatile and semi-volatile particles in air. The reported methodology and findings can guide further studies on volatility characterization of e-cig emission and regulations on indoor vaping.

preprint2021arXiv

Computational Imaging Without a Computer: Seeing Through Random Diffusers at the Speed of Light

Imaging through diffusers presents a challenging problem with various digital image reconstruction solutions demonstrated to date using computers. We present a computer-free, all-optical image reconstruction method to see through random diffusers at the speed of light. Using deep learning, a set of diffractive surfaces are designed/trained to all-optically reconstruct images of objects that are covered by random phase diffusers. We experimentally demonstrated this concept using coherent THz illumination and all-optically reconstructed objects distorted by unknown, random diffusers, never used during training. Unlike digital methods, all-optical diffractive reconstructions do not require power except for the illumination light. This diffractive solution to see through diffusers can be extended to other wavelengths, and might fuel various applications in biomedical imaging, astronomy, atmospheric sciences, oceanography, security, robotics, among others.

preprint2021arXiv

Dual-Path Modeling for Long Recording Speech Separation in Meetings

The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.

preprint2021arXiv

High-current CNT films grown directly on commercially available 2.5D substrates for low-voltage field-emission electron sources

Carbon nanotube (CNT) based electronic devices are promising for beyond-silicon solid-state electronics and vacuum micro-nano-electronics. Despite rapid progress in CNT field-effect transistor related solid-state electronics, the development of CNT-based vacuum nanoelectronic devices is substantially blocked by the longstanding challenges in demanding high-current field-emission (FE) electron sources at low operating voltage. In addition to CNTs' properties, FE characteristics are also affected by substrate morphology and interface state. This work demonstrates high-current FE characteristics at relatively low operating voltage by using CNT films grown directly on commercially available 2.5D substrates with matched feature size and improved interface contact. Simulation results indicate that the commercially available 2.5D substrate including nickel foam (NiF) and carbon cloth (CC) with appropriate feature size would dramatically help to enhance emission current at a relatively lower voltage. Modified fabrication process results in improved contact between CNTs and the underlying 2.5D substrates. Twenty times higher emission current density with halved lower turn-on electric field achieved by CNTs grown directly on randomly picked NiF shows the potential of 2.5D substrate with good contact in improving FE characteristics. Finally, a high emission current (6 mA) with approximately 75 percent decrease in turn-on electric field was realized by matching the feature size of 2.5D substrate with that of CNTs, bringing us significantly closer to reliable high-current and low-voltage FE electron sources for practical applications.

preprint2021arXiv

Low half-wave-voltage, ultra-high bandwidth thin-film LiNbO3 modulator based on hybrid waveguide and periodic capacitively loaded electrodes

A novel thin-film LiNbO3 (TFLN) electro-optic modulator is proposed and demonstrated. LiNbO3-silica hybrid waveguide is adopted to maintain low optical loss for an electrode spacing as narrow as 3 μm, resulting in a record low half-wave-voltage length product of only 1.7 V*cm. Capacitively loaded traveling-wave electrodes (CL-TWEs) are employed to reduce the microwave loss, while quartz substrate is used in place of silicon substrate to achieve velocity matching. The fabricated TFLN modulator with a 5-mm-long modulation region exhibits a half-wave-voltage of 3.4 V and merely 1.3 dB roll-off in electro-optic response up to 67 GHz, and a 3-dB modulation bandwidth over 110 GHz is predicted.

preprint2021arXiv

Observation of optical gyromagnetic properties in a magneto-plasmonic metamaterial

Metamaterials with artificial optical properties have attracted significant research interest. In particular, artificial magnetic resonances in non-unity permeability tensor at optical frequencies in metamaterials have been reported. However, only non-unity diagonal elements of the permeability tensor have been demonstrated to date. A gyromagnetic permeability tensor with non-zero off-diagonal elements has not been observed at the optical frequencies. Here we report the observation of gyromagnetic properties in the near-infrared wavelength range in a magneto-plasmonic metamaterial. The non-zero off-diagonal permeability tensor element causes the transverse magneto-optical Kerr effect (TMOKE) under s-polarized incidence that otherwise vanishes if the permeability tensor is not gyromagnetic. By retrieving the permeability tensor elements from reflection, transmission, and TMOKE spectra, we show that the effective off-diagonal permeability tensor elements reach the 10-3 level at the resonance wavelength (~900 nm) of the split-ring resonators that is at least two orders of magnitude higher than that of magneto-optical materials at the same wavelength. The artificial gyromagnetic permeability is attributed to the change in the local electric field direction modulated by the split-ring resonators. Our study demonstrates the possibility of engineering the permeability and permittivity tensors in metamaterials at arbitrary frequencies, thereby promising a variety of applications of next-generation nonreciprocal photonic devices, magneto-plasmonic sensors, and active metamaterials.

preprint2021arXiv

Ultrafast Parallel LiDAR with Time-encoding and Spectral Scanning: Breaking the Time-of-flight Limit

Light detection and ranging (LiDAR) has been widely used in autonomous driving and large-scale manufacturing. Although state-of-the-art scanning LiDAR can perform long-range three-dimensional imaging, the frame rate is limited by both round-trip delay and the beam steering speed, hindering the development of high-speed autonomous vehicles. For hundred-meter level ranging applications, a several-time speedup is highly desirable. Here, we uniquely combine fiber-based encoders with wavelength-division multiplexing devices to implement all-optical time-encoding on the illumination light. Using this method, parallel detection and fast inertia-free spectral scanning can be achieved simultaneously with single-pixel detection. As a result, the frame rate of a scanning LiDAR can be multiplied with scalability. We demonstrate a 4.4-fold speedup for a maximum 75-m detection range, compared with a time-of-flight-limited laser ranging system. This approach has the potential to improve the velocity of LiDAR-based autonomous vehicles to the regime of hundred kilometers per hour and open up a new paradigm for ultrafast-frame-rate LiDAR imaging.

preprint2020arXiv

An End-to-end Architecture of Online Multi-channel Speech Separation

Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is processed by fixed beamformers, followed by aneural network post filtering. Although promising results wereobtained, the system contains multiple individually developedmodules, leading potentially sub-optimum performance. In thiswork, we introduce an end-to-end modeling version of UFE. Toenable gradient propagation all the way, an attentional selectionmodule is proposed, where an attentional weight is learnt foreach beamformer and spatial feature sampled over space. Ex-perimental results show that the proposed system achieves com-parable performance in an offline evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation.

preprint2020arXiv

Continuous speech separation: dataset and analysis

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully} overlapped, and the algorithms are evaluated based on signal-to-distortion ratio or similar performance metrics. However, in natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. In addition, the signal-based metrics have very weak correlations with automatic speech recognition (ASR) accuracy. We think that not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree. A new real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is also established by using a well-trained multi-conditional acoustic model. By using this dataset, several aspects of a recently proposed speaker-independent CSS algorithm are investigated. The dataset and evaluation scripts are available to facilitate the research in this direction.

preprint2020arXiv

Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation

Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one-dimensional convolutional neural networks (1-D CNNs) cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length. In this paper, we propose dual-path recurrent neural network (DPRNN), a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential input into smaller chunks and applies intra- and inter-chunk operations iteratively, where the input length can be made proportional to the square root of the original sequence length in each operation. Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.

preprint2020arXiv

End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.

preprint2020arXiv

Low energy magnons in the chiral ferrimagnet $\text{Cu}_2\text{OSeO}_3$: a coarse-grained approach

We report a comprehensive neutron scattering study of low energy magnetic excitations in the breathing pyrochlore helimagnetic $\text{Cu}_2\text{OSeO}_3$. Fully documenting the four lowest energy magnetic modes that leave the ferrimagnetic configuration of the &#34;strong tetrahedra&#34; intact ($|\hbarω|<13$ meV), we find gapless quadratic dispersion at the $Γ$ point for energies above 0.2 meV, two doublets separated by 1.6(2) meV at the $R$ point, and a bounded continuum at the $X$ point. Our constrained rigid spin cluster model relates these features to Dzyaloshinskii-Moriya (DM) interactions and the incommensurate helical ground state. Combining conventional spin wave theory with a spin cluster form-factor accurately reproduces the measured equal time structure factor through multiple Brillouin zones. An effective spin Hamiltonian describing the complex anisotropic inter-cluster interactions is obtained.

preprint2020arXiv

Real-time binaural speech separation with preserved spatial cues

Deep learning speech separation algorithms have achieved great success in improving the quality and intelligibility of separated speech from mixed audio. Most previous methods focused on generating a single-channel output for each of the target speakers, hence discarding the spatial cues needed for the localization of sound sources in space. However, preserving the spatial information is important in many applications that aim to accurately render the acoustic scene such as in hearing aids and augmented reality (AR). Here, we propose a speech separation algorithm that preserves the interaural cues of separated sound sources and can be implemented with low latency and high fidelity, therefore enabling a real-time modification of the acoustic scene. Based on the time-domain audio separation network (TasNet), a single-channel time-domain speech separation system that can be implemented in real-time, we propose a multi-input-multi-output (MIMO) end-to-end extension of TasNet that takes binaural mixed audio as input and simultaneously separates target speakers in both channels. Experimental results show that the proposed end-to-end MIMO system is able to significantly improve the separation performance and keep the perceived location of the modified sources intact in various acoustic scenes.

preprint2020arXiv

Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss

Many recent source separation systems are designed to separate a fixed number of sources out of a mixture. In the cases where the source activation patterns are unknown, such systems have to either adjust the number of outputs or to identify invalid outputs from the valid ones. Iterative separation methods have gain much attention in the community as they can flexibly decide the number of outputs, however (1) they typically rely on long-term information to determine the stopping time for the iterations, which makes them hard to operate in a causal setting; (2) they lack a &#34;fault tolerance&#34; mechanism when the estimated number of sources is different from the actual number. In this paper, we propose a simple training method, the auxiliary autoencoding permutation invariant training (A2PIT), to alleviate the two issues. A2PIT assumes a fixed number of outputs and uses auxiliary autoencoding loss to force the invalid outputs to be the copies of the input mixture, and detects invalid outputs in a fully unsupervised way during inference phase. Experiment results show that A2PIT is able to improve the separation performance across various numbers of speakers and effectively detect the number of speakers in a mixture.

preprint2020arXiv

Terahertz Pulse Shaping Using Diffractive Surfaces

Recent advances in deep learning have been providing non-intuitive solutions to various inverse problems in optics. At the intersection of machine learning and optics, diffractive networks merge wave-optics with deep learning to design task-specific elements to all-optically perform various tasks such as object classification and machine vision. Here, we present a diffractive network, which is used to shape an arbitrary broadband pulse into a desired optical waveform, forming a compact pulse engineering system. We experimentally demonstrate the synthesis of square pulses with different temporal-widths by manufacturing passive diffractive layers that collectively control both the spectral amplitude and the phase of an input terahertz pulse. Our results constitute the first demonstration of direct pulse shaping in terahertz spectrum, where a complex-valued spectral modulation function directly acts on terahertz frequencies. Furthermore, a Lego-like physical transfer learning approach is presented to illustrate pulse-width tunability by replacing part of an existing network with newly trained diffractive layers, demonstrating its modularity. This learning-based diffractive pulse engineering framework can find broad applications in e.g., communications, ultra-fast imaging and spectroscopy.

preprint2019arXiv

Fragmentation and isomerization of polycyclic aromatic hydrocarbons in the interstellar medium: coronene as a case study

Aims. Due to the limitations of current computational technology, the fragmentation and isomerization products of vibrationally-excited polycyclic aromatic hydrocarbon (PAH) molecules and their derivatives are poorly studied. In this work, we investigate the intermediate products of PAHs and their derivatives as well as the gas-phase reactions relevant to the interstellar medium, with coronene as a case study. Methods. Based on the semi-empirical method of PM3 as implemented in the CP2K program, molecular dynamics simulations are performed to model the major processes (e.g., vibrations, fragmentations, and isomerizations) of coronene and its derivatives (e.g., methylated coronene, hydrogenated coronene, dehydrogenated coronene, nitrogen-substituted coronene, and oxygen-substituted coronene) at temperatures of 3000 K and 4000 K. Results. We find that the anharmonic effects are crucial for the simulation of vibrational excitation. For the molecules studied here, H2, CO, HCN, and CH2 are the major fragments. Following the dissociation of these small units, most of the molecules could maintain their ring structures, but a few molecules would break completely into carbon chains. The transformation from hexagon to pentagon or heptagon may occur and the heteroatomic substitutions (e.g., N- or O-substitutions) facilitate the transformation.

preprint2019arXiv

Visually Constructing the Chemical Structure of a Single Molecule by Scanning Raman Picoscopy

The strong spatial confinement of a nanocavity plasmonic field has made it possible to visualize the inner structure of a single molecule and even to distinguish its vibrational modes in real space. With such ever-improved spatial resolution, it is anticipated that full vibrational imaging of a molecule could be achieved to reveal molecular structural details. Here we demonstrate full Raman images of individual vibrational modes on the Ångström level for a single Mg-porphine molecule, revealing distinct characteristics of each vibrational mode in real space. Furthermore, by exploiting the underlying interference effect and Raman fingerprint database, we propose a new methodology for structural determination, coined as scanning Raman picoscopy, to show how such ultrahigh-resolution spectromicroscopic vibrational images can be used to visually assemble the chemical structure of a single molecule through a simple Lego-like building process.