Source author record

Kun Qian

Kun Qian appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language eess.AS Sound Machine Learning Computer Vision Artificial Intelligence cond-mat.mtrl-sci Databases Distributed, Parallel, and Cluster Computing eess.IV eess.SP Information Retrieval Networking and Internet Architecture physics.app-ph physics.comp-ph quant-ph

Catalog footprint

What is connected

18works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

3D Primitives are a Spatial Language for VLMs

Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

preprint2026arXiv

LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries

Personalization today is fundamentally platform-centric: services build user representations from the behavioral fragments they observe. Yet no platform can construct a complete picture of the user, as competitive incentives, legal constraints, user privacy concerns, and epistemic limits create persistent data barriers. This paper argues for a shift from platform-centric personalization to user-governed personalization, where only the user can integrate fragmented contexts across platforms and the offline world. The key asymmetry lies in data access: only users can aggregate their own cross-platform and offline information. Large language model (LLM) agents make such integration practically feasible for the first time by enabling reasoning over heterogeneous personal data and transforming users' cross-context information into actionable personalization capabilities. We provide proof-of-concept evidence that users equipped with cross-platform data exports and an off-the-shelf LLM agent can outperform single-platform personalization baselines. We conclude by outlining a research agenda for building scalable user-governed personalization systems.

preprint2023arXiv

Unicron: Economizing Self-Healing LLM Training at Scale

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

preprint2022arXiv

Annotation Inconsistency and Entity Bias in MultiWOZ

MultiWOZ is one of the most popular multi-domain task-oriented dialog datasets, containing 10K+ annotated dialogs covering eight domains. It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG), and end-to-end (E2E) dialog modeling. In this work, we identify an overlooked issue with dialog state annotation inconsistencies in the dataset, where a slot type is tagged inconsistently across similar dialogs leading to confusion for DST modeling. We propose an automated correction for this issue, which is present in a whopping 70% of the dialogs. Additionally, we notice that there is significant entity bias in the dataset (e.g., "cambridge" appears in 50% of the destination cities in the train domain). The entity bias can potentially lead to named entity memorization in generative models, which may go unnoticed as the test set suffers from a similar entity bias as well. We release a new test set with all entities replaced with unseen entities. Finally, we benchmark joint goal accuracy (JGA) of the state-of-the-art DST baselines on these modified versions of the data. Our experiments show that the annotation inconsistency corrections lead to 7-10% improvement in JGA. On the other hand, we observe a 29% drop in JGA when models are evaluated on the new test set with unseen entities.

preprint2022arXiv

Audio Self-supervised Learning: A Survey

Inspired by the humans' cognitive ability to generalise knowledge and skills, Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations, which is an expensive and time consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarising the knowledge in audio SSL are currently missing. To fill this gap, in the present work, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarise the empirical works that exploit the audio modality in multi-modal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions on the development of audio SSL.

preprint2022arXiv

Learning a Better Initialization for Soft Prompts via Meta-Learning

Prompt tuning (PT) is an effective approach to adapting pre-trained language models to downstream tasks. Without a good initialization, prompt tuning doesn't perform well under few-shot settings. So pre-trained prompt tuning (PPT) is proposed to initialize prompts by leveraging pre-training data. We propose MetaPT (Meta-learned Prompt Tuning) to further improve PPT's initialization by considering latent structure within the pre-training data. Specifically, we introduce the structure by first clustering pre-training data into different auxiliary tasks with unsupervised methods. Then we use these tasks to pre-train prompts with a meta-learning algorithm. Such a process can make prompts learn a better initialization by discovering commonalities among these auxiliary tasks. We evaluate our method on seven downstream tasks. Our MetaPT achieves better and more stable performance than the state-of-the-art method.

preprint2022arXiv

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new optimization scheme, memory replay back-propagation (MRBP), which promotes long-range back-propagation through time with a significantly reduced memory requirement. Experimental results show that Memformer has achieved comparable performance compared to the baselines by using 8.1x less memory space and 3.2x faster on inference. Analysis of the attention pattern shows that our external memory slots can encode and retain important information through timesteps.

preprint2022arXiv

Quantum Deep Learning for Mutant COVID-19 Strain Prediction

New COVID-19 epidemic strains like Delta and Omicron with increased transmissibility and pathogenicity emerge and spread across the whole world rapidly while causing high mortality during the pandemic period. Early prediction of possible variants (especially spike protein) of COVID-19 epidemic strains based on available mutated SARS-CoV-2 RNA sequences may lead to early prevention and treatment. Here, combining the advantage of quantum and quantum-inspired algorithms with the wide application of deep learning, we propose a development tool named DeepQuantum, and use this software to realize the goal of predicting spike protein variation structure of COVID-19 epidemic strains. In addition, this hybrid quantum-classical model for the first time achieves quantum-inspired blur convolution similar to classical depthwise convolution and also successfully applies quantum progressive training with quantum circuits, both of which guarantee that our model is the quantum counterpart of the famous style-based GAN. The results state that the fidelities of random generating spike protein variation structure are always beyond 96% for Delta, 94% for Omicron. The training loss curve is more stable and converges better with multiple loss functions compared with the corresponding classical algorithm. At last, evidences that quantum-inspired algorithms promote the classical deep learning and hybrid models effectively predict the mutant strains are strong.

preprint2021arXiv

$\boldsymbolγ$-Net: Superresolving SAR Tomographic Inversion via Deep Learning

Synthetic aperture radar tomography (TomoSAR) has been extensively employed in 3-D reconstruction in dense urban areas using high-resolution SAR acquisitions. Compressive sensing (CS)-based algorithms are generally considered as the state of the art in super-resolving TomoSAR, in particular in the single look case. This superior performance comes at the cost of extra computational burdens, because of the sparse reconstruction, which cannot be solved analytically and we need to employ computationally expensive iterative solvers. In this paper, we propose a novel deep learning-based super-resolving TomoSAR inversion approach, $\boldsymbolγ$-Net, to tackle this challenge. $\boldsymbolγ$-Net adopts advanced complex-valued learned iterative shrinkage thresholding algorithm (CV-LISTA) to mimic the iterative optimization step in sparse reconstruction. Simulations show the height estimate from a well-trained $\boldsymbolγ$-Net approaches the Cramér-Rao lower bound while improving the computational efficiency by 1 to 2 orders of magnitude comparing to the first-order CS-based methods. It also shows no degradation in the super-resolution power comparing to the state-of-the-art second-order TomoSAR solvers, which are much more computationally expensive than the first-order methods. Specifically, $\boldsymbolγ$-Net reaches more than $90\%$ detection rate in moderate super-resolving cases at 25 measurements at 6dB SNR. Moreover, simulation at limited baselines demonstrates that the proposed algorithm outperforms the second-order CS-based method by a fair margin. Test on real TerraSAR-X data with just 6 interferograms also shows high-quality 3-D reconstruction with high-density detected double scatterers.

preprint2021arXiv

Deep Attention-based Representation Learning for Heart Sound Classification

Cardiovascular diseases are the leading cause of deaths and severely threaten human health in daily life. On the one hand, there have been dramatically increasing demands from both the clinical practice and the smart home application for monitoring the heart status of subjects suffering from chronic cardiovascular diseases. On the other hand, experienced physicians who can perform an efficient auscultation are still lacking in terms of number. Automatic heart sound classification leveraging the power of advanced signal processing and machine learning technologies has shown encouraging results. Nevertheless, human hand-crafted features are expensive and time-consuming. To this end, we propose a novel deep representation learning method with an attention mechanism for heart sound classification. In this paradigm, high-level representations are learnt automatically from the recorded heart sound data. Particularly, a global attention pooling layer improves the performance of the learnt representations by estimating the contribution of each unit in feature maps. The Heart Sounds Shenzhen (HSS) corpus (170 subjects involved) is used to validate the proposed method. Experimental results validate that, our approach can achieve an unweighted average recall of 51.2% for classifying three categories of heart sounds, i. e., normal, mild, and moderate/severe annotated by cardiologists with the help of Echocardiography.

preprint2020arXiv

An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this study, we focus on developing some potential use-cases of intelligent speech analysis for COVID-19 diagnosed patients. In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety. For this purpose, two established acoustic feature sets and support vector machines are utilised. Our experiments show that an average accuracy of .69 obtained estimating the severity of illness, which is derived from the number of days in hospitalisation. We hope that this study can foster an extremely fast, low-cost, and convenient way to automatically detect the COVID-19 disease.

preprint2020arXiv

Broadband Free Space Impedance in $\mathrm{Co_2Z}$ Hexaferrites by Substitution of Quadrivalent Heavy Transition Metal Ions for Miniaturized RF Devices

Polycrystalline samples of Z-type hexaferrites, having nominal compositions $\mathrm{Ba_3Co_{2+x}Fe_{24-2x}M_xO_{41}}$ where M = $\mathrm{Ir^{4+}, Hf^{4+}, Mo^{4+}}$ and x=0 and 0.05, were processed via ceramic processing protocols in pursuit of low magnetic and dielectric losses as well as equivalent permittivity and permeability. Fine process control was conducted to ensure optimal magnetic properties. Organic dispersants (i.e., isobutylene and maleic anhydride) were employed to achieve maximum densities. Crystallographic structure, characterized by X-ray diffraction, revealed that doping with $\mathrm{Ir^{4+}, Hf^{4+}, Mo^{4+}}$ did not adversely affect the crystal structure and phase purity of the Z-type hexaferrite. The measured microwave and magnetic properties show that the resonant frequency shifts depending on the specific dopant allowing for tunability of the operational frequency and bandwidth. The frequency bandwidth in which permittivity and permeability are very near equal (i.e., ~400 MHz for $\mathrm{Mo^{4+}}$ (x), where x=0.05 doping) is shown to occur at frequencies between 0.2 and 1.0 GHz depending on dopant type. These results give rise to low loss at 650 MHz, with considerable size reduction of an order of magnitude, while maintaining the characteristic impedance of free space (i.e., 377 $\mathrmΩ$). These results allow for miniaturization and optimized band-pass performance of magnetodielectric materials for communication devices such as antenna and radomes that can be engineered to operate over desired frequency ranges using cost effective and volumetric processing methodologies.

preprint2020arXiv

COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV-2 Corona Crisis

At the time of writing, the world population is suffering from more than 10,000 registered COVID-19 disease epidemic induced deaths since the outbreak of the Corona virus more than three months ago now officially known as SARS-CoV-2. Since, tremendous efforts have been made worldwide to counter-steer and control the epidemic by now labelled as pandemic. In this contribution, we provide an overview on the potential for computer audition (CA), i.e., the usage of speech and sound analysis by artificial intelligence to help in this scenario. We first survey which types of related or contextually significant phenomena can be automatically assessed from speech or sound. These include the automatic recognition and monitoring of breathing, dry and wet coughing or sneezing sounds, speech under cold, eating behaviour, sleepiness, or pain to name but a few. Then, we consider potential use-cases for exploitation. These include risk assessment and diagnosis based on symptom histograms and their development over time, as well as monitoring of spread, social distancing and its effects, treatment and recovery, and patient wellbeing. We quickly guide further through challenges that need to be faced for real-life usage. We come to the conclusion that CA appears ready for implementation of (pre-)diagnosis and monitoring tools, and more generally provides rich and significant, yet so far untapped potential in the fight against COVID-19 spread.

preprint2020arXiv

deepSELF: An Open Source Deep Self End-to-End Learning Framework

We introduce an open-source toolkit, i.e., the deep Self End-to-end Learning Framework (deepSELF), as a toolkit of deep self end-to-end learning framework for multi-modal signals. To the best of our knowledge, it is the first public toolkit assembling a series of state-of-the-art deep learning technologies. Highlights of the proposed deepSELF toolkit include: First, it can be used to analyse a variety of multi-modal signals, including images, audio, and single or multi-channel sensor data. Second, we provide multiple options for pre-processing, e.g., filtering, or spectrum image generation by Fourier or wavelet transformation. Third, plenty of topologies in terms of NN, 1D/2D/3D CNN, and RNN/LSTM/GRU can be customised and a series of pretrained 2D CNN models, e.g., AlexNet, VGGNet, ResNet can be used easily. Last but not least, above these features, deepSELF can be flexibly used not only as a single model but also as a fusion of such.

preprint2020arXiv

MeDaS: An open-source platform as service to help break the walls between medicine and informatics

In the past decade, deep learning (DL) has achieved unprecedented success in numerous fields including computer vision, natural language processing, and healthcare. In particular, DL is experiencing an increasing development in applications for advanced medical image analysis in terms of analysis, segmentation, classification, and furthermore. On the one hand, tremendous needs that leverage the power of DL for medical image analysis are arising from the research community of a medical, clinical, and informatics background to jointly share their expertise, knowledge, skills, and experience. On the other hand, barriers between disciplines are on the road for them often hampering a full and efficient collaboration. To this end, we propose our novel open-source platform, i.e., MeDaS -- the MeDical open-source platform as Service. To the best of our knowledge, MeDaS is the first open-source platform proving a collaborative and interactive service for researchers from a medical background easily using DL related toolkits, and at the same time for scientists or engineers from information sciences to understand the medical knowledge side. Based on a series of toolkits and utilities from the idea of RINV (Rapid Implementation aNd Verification), our proposed MeDaS platform can implement pre-processing, post-processing, augmentation, visualization, and other phases needed in medical image analysis. Five tasks including the subjects of lung, liver, brain, chest, and pathology, are validated and demonstrated to be efficiently realisable by using MeDaS.

preprint2020arXiv

Prediction of mechanical properties of non-equiatomic high-entropy alloy by atomistic simulation and machine learning

High-entropy alloys (HEAs) with multiple constituent elements have been extensively studied in the past 20 years due to their promising engineering application. Previous experimental and computational studies of HEAs focused mainly on equiatomic or near equiatomic HEAs. However, there is probably far more treasure in those non-equiatomic HEAs with carefully designed composition. In this study, molecular dynamics (MD) simulation combined with machine learning (ML) methods were used to predict the mechanical properties of non-equiatomic CuFeNiCrCo HEAs. A database was established based on a tensile test of 900 HEA single-crystal samples by MD simulation. We investigated and compared eight ML models for the learning tasks, ranging from shallow models to deep models. It was found that the kernel-based extreme learning machine (KELM) model outperformed others for the prediction of yield stress and Young's modulus. The accuracy of the KELM model was further verified by the large-sized polycrystal HEA samples.

preprint2020arXiv

Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction

Better machine understanding of pedestrian behaviors enables faster progress in modeling interactions between agents such as autonomous vehicles and humans. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects. Previous methods modeled these interactions by using a variety of aggregation methods that integrate different learned pedestrians states. We propose the Social Spatio-Temporal Graph Convolutional Neural Network (Social-STGCNN), which substitutes the need of aggregation methods by modeling the interactions as a graph. Our results show an improvement over the state of art by 20% on the Final Displacement Error (FDE) and an improvement on the Average Displacement Error (ADE) with 8.5 times less parameters and up to 48 times faster inference speed than previously reported methods. In addition, our model is data efficient, and exceeds previous state of the art on the ADE metric with only 20% of the training data. We propose a kernel function to embed the social interactions between pedestrians within the adjacency matrix. Through qualitative analysis, we show that our model inherited social behaviors that can be expected between pedestrians trajectories. Code is available at https://github.com/abduallahmohamed/Social-STGCNN.

preprint2016arXiv

Isolating Mice and Elephant in Data Centers

Data centers traffic is composed by numerous latency-sensitive "mice" flows, which is consisted of only several packets, and a few throughput-sensitive "elephant" flows, which occupy more than 80% of overall load. Generally, the short-lived "mice" flows induce transient congestion and the long-lived "elephant" flows cause persistent congestion. The network congestion is a major performance inhibitor. Conventionally, the hop-by-hop and end-to-end flow control mechanisms are employed to relief transient and persistent congestion, respectively. However, in face of the mixture of elephants and mice, we find the hybrid congestion control scheme including hop-by-hop and end-to-end flow control mechanisms suffers from serious performance impairments. As a step further, our in-depth analysis reveals that the hybrid scheme performs poor at latency of mice and throughput of elephant. Motivated by this understanding, we argue for isolating mice and elephants in different queues, such that the hop-by-hop and end-to-end flow control mechanisms are independently imposed to short-lived and long-lived flows, respectively. Our solution is readily-deployable and compatible with current commodity network devices and can leverage various congestion control mechanisms. Extensive simulations show that our proposal of isolation can simultaneously improve the latency of mice by at least 30% and the link utilization to almost 100%.

Kun Qian

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

3D Primitives are a Spatial Language for VLMs

LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries

Unicron: Economizing Self-Healing LLM Training at Scale

Annotation Inconsistency and Entity Bias in MultiWOZ

Audio Self-supervised Learning: A Survey

Learning a Better Initialization for Soft Prompts via Meta-Learning

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Quantum Deep Learning for Mutant COVID-19 Strain Prediction

$\boldsymbolγ$-Net: Superresolving SAR Tomographic Inversion via Deep Learning

Deep Attention-based Representation Learning for Heart Sound Classification

An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

Broadband Free Space Impedance in $\mathrm{Co_2Z}$ Hexaferrites by Substitution of Quadrivalent Heavy Transition Metal Ions for Miniaturized RF Devices

COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV-2 Corona Crisis

deepSELF: An Open Source Deep Self End-to-End Learning Framework

MeDaS: An open-source platform as service to help break the walls between medicine and informatics

Prediction of mechanical properties of non-equiatomic high-entropy alloy by atomistic simulation and machine learning

Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction

Isolating Mice and Elephant in Data Centers