Source author record

Tong Zhang

Tong Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

161works

43topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Anomaly-Preference Image Generation

Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.

preprint2026arXiv

Code as Agent Harness

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

preprint2026arXiv

Immunological Density Shapes Recovery Trajectories in Long COVID

Post-acute sequelae of SARS-CoV-2 infection (Long COVID) frequently persists for months, yet drivers of clinical remission remain incompletely defined. Here we analyzed 97,564 longitudinal PASC assessments from 13,511 participants with linked vaccination histories to disentangle passive temporal progression from vaccine-associated change. Using a clinically validated threshold (PASC $\geq 12$), trajectories separated into three phenotypes: Protected (persistently sub-threshold), Refractory (persistently symptomatic), and Responders (transitioning from symptomatic to recovered). Across the full cohort, symptom severity increased modestly with elapsed time ($r=0.0521$, $P=1.26\times10^{-59}$), whereas cumulative vaccination showed an inverse association with severity ($r=-0.0434$, $P=5.95\times10^{-42}$). In summary, baseline Long COVID severity appears clinically deterministic. In the absence of intervention, symptoms typically persist without spontaneous resolution. Recovery is primarily associated with repeated immunization.

preprint2026arXiv

Indoor Fluid Antenna Systems Enabled by Layout-Specific Modeling and Group Relative Policy Optimization

Fluid antenna system (FAS) revolutionizes wireless communications via utilizing position-flexible antennas that dynamically optimize channel conditions and mitigate multipath fading. This innovation is particularly valuable in indoor environments, in which signal propagation is severely degraded due to structural obstructions and complex multipath reflections. In this paper, we investigate the channel modeling and the joint optimization of antenna positioning, beamforming, and power allocation for indoor FAS. In particular, we propose a layout-specific channel model, and employ the novel group relative policy optimization (GRPO) algorithm for tackling the optimization problem. Compared to the state-of-the-art Sionna model, our model achieves an 83.3% reduction in computation time with an approximately 3 dB increase in root-mean-square error (RMSE). When simplified to a two-ray model, our model allows for a closed-form antenna position solution with near-optimal performance. For the joint optimization problem, our GRPO algorithm outperforms proximal policy optimization (PPO) and other baselines in sum-rate, while requiring only 50.8% computational resources of PPO, thanks to its group advantage estimation. Simulation results show that increasing either the group size or trajectory length in GRPO does not yield significant improvements in sum-rate, suggesting that these parameters can be selected conservatively without sacrificing performance.

preprint2026arXiv

Learning-based Multi-View Stereo: A Survey

3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.

preprint2026arXiv

Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.

preprint2026arXiv

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

preprint2026arXiv

Orchard: An Open-Source Agentic Modeling Framework

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

preprint2026arXiv

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.

preprint2026arXiv

Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2$\times$-2.4$\times$ end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.

preprint2026arXiv

Self-DACE++: Robust Low-Light Enhancement via Efficient Adaptive Curve Estimation

In this paper, we present Self-DACE++, an improved unsupervised and lightweight framework for Low-Light Image Enhancement (LLIE), building upon our previous Self-Reference Deep Adaptive Curve Estimation (Self-DACE). To better address the trade-off between computational efficiency and restoration quality, Self-DACE++ introduces enhanced Adaptive Adjustment Curves (AACs). These curves, governed by minimal trainable parameters, flexibly adjust the dynamic range while preserving the color fidelity, structural integrity, and naturalness of the enhanced images. To achieve an extremely lightweight architecture without sacrificing performance, we propose a randomized order training strategy coupled with a network fusion mechanism, which compresses the model into an efficient iterative inference structure. Furthermore, we formulate a physics-grounded objective function based on Retinex theory and incorporate a dedicated denoising module to effectively estimate and suppress latent noise in dark regions. Extensive qualitative and quantitative evaluations on multiple real-world benchmark datasets demonstrate that Self-DACE++ outperforms existing state-of-the-art methods, delivering superior enhancement quality with real-time inference capability. The code is available at https://github.com/John-Wendell/Self-DACE.

preprint2026arXiv

Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

preprint2026arXiv

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

preprint2025arXiv

Fully First-Order Methods for Decentralized Bilevel Optimization

This paper focuses on decentralized stochastic bilevel optimization (DSBO) where agents only communicate with their neighbors. We propose Decentralized Stochastic Gradient Descent and Ascent with Gradient Tracking (DSGDA-GT), a novel algorithm that only requires first-order oracles that are much cheaper than second-order oracles widely adopted in existing works. We further provide a finite-time convergence analysis showing that for $n$ agents collaboratively solving the DSBO problem, the sample complexity of finding an $ε$-stationary point in our algorithm is $\mathcal{O}(n^{-1}ε^{-7})$, which matches the currently best-known results of the single-agent counterpart with linear speedup. The numerical experiments demonstrate both the communication and training efficiency of our algorithm.

preprint2024arXiv

AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets

We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.

preprint2023arXiv

An Indoor Environment Sensing and Localization System via mmWave Phased Array

An indoor layout sensing and localization system in 60GHz millimeter wave (mmWave) band, named mmReality, is elaborated in this paper. The mmReality system consists of one transmitter and one mobile receiver, each with a phased array and a single radio frequency (RF) chain. To reconstruct the room layout, the pilot signal is delivered from the transmitter to the receiver via different pairs of transmission and receiving beams, so that the signals at all antenna elements can be resolved. Then, the spatial smoothing and two-dimensional multiple signal classification (MUSIC) algorithm is applied to detect the angle-of-arrival (AoAs) and angle-of-departure (AoDs) of the rays from the transmitter to the receiver. Moreover, the technique of multi-carrier ranging is adopted to measure the distance of each propagation path. Synthesizing the above geometrical parameters, the location of receiver relative to the transmitter can be pinpointed, both line-of-sight (LoS) and non-line-of-sight (NLoS) paths can also be determined. Therefore, the room layout can be reconstructed by moving the receiver and repeating the above measurement in different locations of the room. At the end, we show that the reconstructed room layout can be utilized to locate a mobile device according to its AoA spectrum, even with single access point.

preprint2022arXiv

A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition

Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. Previous work mainly focused on learning better representation via (mask) language model pre-training but ignored the intrinsic structure of the music, which is extremely important to the emotional expression of music. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks derived from the intrinsic structure of the music. The results show that our multi-task framework can be adapted to different models. Moreover, the labels of auxiliary tasks are easy to be obtained, which means our multi-task methods do not require manually annotated labels other than emotion. Conducting on two publicly available datasets (EMOPIA and VGMIDI), the experiments show that our methods perform better in SMER task. Specifically, accuracy has been increased by 4.17 absolute point to 67.58 in EMOPIA dataset, and 1.97 absolute point to 55.85 in VGMIDI dataset. Ablation studies also show the effectiveness of multi-task methods designed in this paper.

preprint2022arXiv

A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning

Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees. We introduce novel proof techniques to show that under suitable conditions, the worst-case regret of our posterior sampling method matches the best known results of optimization based methods. In the linear MDP setting with dimension, the regret of our algorithm scales linearly with the dimension as compared to a quadratic dependence of the existing posterior sampling-based exploration algorithms.

preprint2022arXiv

Accelerating Edge Intelligence via Integrated Sensing and Communication

Realizing edge intelligence consists of sensing, communication, training, and inference stages. Conventionally, the sensing and communication stages are executed sequentially, which results in excessive amount of dataset generation and uploading time. This paper proposes to accelerate edge intelligence via integrated sensing and communication (ISAC). As such, the sensing and communication stages are merged so as to make the best use of the wireless signals for the dual purpose of dataset generation and uploading. However, ISAC also introduces additional interference between sensing and communication functionalities. To address this challenge, this paper proposes a classification error minimization formulation to design the ISAC beamforming and time allocation. The globally optimal solution is derived via the rank-1 guaranteed semidefinite relaxation, and performance analysis is performed to quantify the ISAC gain over that of conventional edge intelligence. Simulation results are provided to verify the effectiveness of the proposed ISAC-assisted edge intelligence system. Interestingly, we find that ISAC is always beneficial, when the duration of generating a sample is more than the duration of uploading a sample. Otherwise, the ISAC gain can vanish or even be negative. Nevertheless, we still derive a sufficient condition, under which a positive ISAC gain is feasible.

preprint2022arXiv

Algebraic threefolds of general type with small volume

It is known that the optimal Noether inequality $\mathrm{vol}(X) \ge \frac{4}{3}p_g(X) - \frac{10}{3}$ holds for every $3$-fold $X$ of general type with $p_g(X) \ge 11$. In this paper, we give a complete classification of $3$-folds $X$ of general type with $p_g(X) \ge 11$ satisfying the above equality by giving the explicit structure of a relative canonical model of $X$. This model coincides with the canonical model of $X$ when $p_g(X) \ge 23$. We also establish the second and third optimal Noether inequalities for $3$-folds $X$ of general type with $p_g(X) \ge 11$. These results answer two open questions raised by J. Chen, M. Chen and C. Jiang, and in dimension three an open question raised by J. Chen and C. Lai. A novel phenomenon shows that there is a one-to-one correspondence between the three Noether inequalities and three possible residues of $p_g(X)$ modulo $3$.

preprint2022arXiv

Asymptotic Statistical Analysis of $f$-divergence GAN

Generative Adversarial Networks (GANs) have achieved great success in data generation. However, its statistical properties are not fully understood. In this paper, we consider the statistical behavior of the general $f$-divergence formulation of GAN, which includes the Kullback--Leibler divergence that is closely related to the maximum likelihood principle. We show that for parametric generative models that are correctly specified, all $f$-divergence GANs with the same discriminator classes are asymptotically equivalent under suitable regularity conditions. Moreover, with an appropriately chosen local discriminator, they become equivalent to the maximum likelihood estimate asymptotically. For generative models that are misspecified, GANs with different $f$-divergences {converge to different estimators}, and thus cannot be directly compared. However, it is shown that for some commonly used $f$-divergences, the original $f$-GAN is not optimal in that one can achieve a smaller asymptotic variance when the discriminator training in the original $f$-GAN formulation is replaced by logistic regression. The resulting estimation method is referred to as Adversarial Gradient Estimation (AGE). Empirical studies are provided to support the theory and to demonstrate the advantage of AGE over the original $f$-GANs under model misspecification.

preprint2022arXiv

Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint

Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.

preprint2022arXiv

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts' semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to inject concept knowledge into the model during training. DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts. Experiments on MSR-VTT and VATEX demonstrate the effectiveness of DCD, and the visualization results further reveal the necessity of learning concept-aware representations.

preprint2022arXiv

Consecutive Pretraining: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain

Currently, under supervised learning, a model pretrained by a large-scale nature scene dataset and then fine-tuned on a few specific task labeling data is the paradigm that has dominated the knowledge transfer learning. It has reached the status of consensus solution for task-aware model training in remote sensing domain (RSD). Unfortunately, due to different categories of imaging data and stiff challenges of data annotation, there is not a large enough and uniform remote sensing dataset to support large-scale pretraining in RSD. Moreover, pretraining models on large-scale nature scene datasets by supervised learning and then directly fine-tuning on diverse downstream tasks seems to be a crude method, which is easily affected by inevitable labeling noise, severe domain gaps and task-aware discrepancies. Thus, in this paper, considering the self-supervised pretraining and powerful vision transformer (ViT) architecture, a concise and effective knowledge transfer learning strategy called ConSecutive PreTraining (CSPT) is proposed based on the idea of not stopping pretraining in natural language processing (NLP), which can gradually bridge the domain gap and transfer knowledge from the nature scene domain to the RSD. The proposed CSPT also can release the huge potential of unlabeled data for task-aware model training. Finally, extensive experiments are carried out on twelve datasets in RSD involving three types of downstream tasks (e.g., scene classification, object detection and land cover classification) and two types of imaging data (e.g., optical and SAR). The results show that by utilizing the proposed CSPT for task-aware model training, almost all downstream tasks in RSD can outperform the previous method of supervised pretraining-then-fine-tuning and even surpass the state-of-the-art (SOTA) performance without any expensive labeling consumption and careful model design.

preprint2022arXiv

Dimension Independent Generalization of DP-SGD for Overparameterized Smooth Convex Optimization

This paper considers the generalization performance of differentially private convex learning. We demonstrate that the convergence analysis of Langevin algorithms can be used to obtain new generalization bounds with differential privacy guarantees for DP-SGD. More specifically, by using some recently obtained dimension-independent convergence results for stochastic Langevin algorithms with convex objective functions, we obtain $O(n^{-1/4})$ privacy guarantees for DP-SGD with the optimal excess generalization error of $\tilde{O}(n^{-1/2})$ for certain classes of overparameterized smooth convex optimization problems. This improves previous DP-SGD results for such problems that contain explicit dimension dependencies, so that the resulting generalization bounds become unsuitable for overparameterized models used in practical applications.

preprint2022arXiv

Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums

Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. In this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.

preprint2022arXiv

Elastic RAID: When RAID Meets SSDs with Built-in Transparent Compression

This paper studies how RAID (redundant array of independent disks) could take full advantage of modern SSDs (solid-state drives) with built-in transparent compression. In current practice, RAID users are forced to choose a specific RAID level (e.g., RAID 10 or RAID 5) with a fixed storage cost vs. speed performance trade-off. Commercial market is witnessing the emergence of a new family of SSDs that can internally perform hardware-based lossless compression on each 4KB LBA (logical block address) block, transparent to host OS and user applications. Beyond straightforwardly reducing the RAID storage cost, such modern SSDs make it possible to relieve RAID users from being locked into a fixed storage cost vs. speed performance trade-off. The key idea is simple: RAID systems opportunistically leverage higher-than-expected runtime user data compressibility to enable dynamic RAID level conversion to improve the speed performance without compromising the effective storage capacity. This paper presents design techniques to enable and optimize the practical implementation of such elastic RAID systems. For the purpose of demonstration, we implemented a Linux software-based elastic RAID prototype that supports dynamic conversion between RAID 5 and RAID 10. Compared with a baseline software-based RAID 5, under sufficient runtime data compressibility that enables the conversion from RAID 5 to RAID 10 over 60% user data, the elastic RAID could improve the 4KB random write IOPS (IO per second) by 42% and 4KB random read IOPS in degraded mode by 46%, while maintaining the same effective storage capacity.

preprint2022arXiv

Exploiting Hybrid Semantics of Relation Paths for Multi-hop Question Answering Over Knowledge Graphs

Answering natural language questions on knowledge graphs (KGQA) remains a great challenge in terms of understanding complex questions via multi-hop reasoning. Previous efforts usually exploit large-scale entity-related text corpora or knowledge graph (KG) embeddings as auxiliary information to facilitate answer selection. However, the rich semantics implied in off-the-shelf relation paths between entities is far from well explored. This paper proposes improving multi-hop KGQA by exploiting relation paths' hybrid semantics. Specifically, we integrate explicit textual information and implicit KG structural features of relation paths based on a novel rotate-and-scale entity link prediction framework. Extensive experiments on three existing KGQA datasets demonstrate the superiority of our method, especially in multi-hop scenarios. Further investigation confirms our method's systematical coordination between questions and relation paths to identify answer entities.

preprint2022arXiv

Exploring Geometric Consistency for Monocular 3D Object Detection

This paper investigates the geometric consistency for monocular 3D object detection, which suffers from the ill-posed depth estimation. We first conduct a thorough analysis to reveal how existing methods fail to consistently localize objects when different geometric shifts occur. In particular, we design a series of geometric manipulations to diagnose existing detectors and then illustrate their vulnerability to consistently associate the depth with object apparent sizes and positions. To alleviate this issue, we propose four geometry-aware data augmentation approaches to enhance the geometric consistency of the detectors. We first modify some commonly used data augmentation methods for 2D images so that they can maintain geometric consistency in 3D spaces. We demonstrate such modifications are important. In addition, we propose a 3D-specific image perturbation method that employs the camera movement. During the augmentation process, the camera system with the corresponding image is manipulated, while the geometric visual cues for depth recovery are preserved. We show that by using the geometric consistency constraints, the proposed augmentation techniques lead to improvements on the KITTI and nuScenes monocular 3D detection benchmarks with state-of-the-art results. In addition, we demonstrate that the augmentation methods are well suited for semi-supervised training and cross-dataset generalization.

preprint2022arXiv

Fast Rates in Pool-Based Batch Active Learning

We consider a batch active learning scenario where the learner adaptively issues batches of points to a labeling oracle. Sampling labels in batches is highly desirable in practice due to the smaller number of interactive rounds with the labeling oracle (often human beings). However, batch active learning typically pays the price of a reduced adaptivity, leading to suboptimal results. In this paper we propose a solution which requires a careful trade off between the informativeness of the queried points and their diversity. We theoretically investigate batch active learning in the practically relevant scenario where the unlabeled pool of data is available beforehand ({\em pool-based} active learning). We analyze a novel stage-wise greedy algorithm and show that, as a function of the label complexity, the excess risk of this algorithm matches the known minimax rates in standard statistical learning settings. Our results also exhibit a mild dependence on the batch size. These are the first theoretical results that employ careful trade offs between informativeness and diversity to rigorously quantify the statistical performance of batch active learning in the pool-based scenario.

preprint2022arXiv

IDEA: Interpretable Dynamic Ensemble Architecture for Time Series Prediction

We enhance the accuracy and generalization of univariate time series point prediction by an explainable ensemble on the fly. We propose an Interpretable Dynamic Ensemble Architecture (IDEA), in which interpretable base learners give predictions independently with sparse communication as a group. The model is composed of several sequentially stacked groups connected by group backcast residuals and recurrent input competition. Ensemble driven by end-to-end training both horizontally and vertically brings state-of-the-art (SOTA) performances. Forecast accuracy improves by 2.6% over the best statistical benchmark on the TOURISM dataset and 2% over the best deep learning benchmark on the M4 dataset. The architecture enjoys several advantages, being applicable to time series from various domains, explainable to users with specialized modular structure and robust to changes in task distribution.

preprint2022arXiv

Leverage Your Local and Global Representations: A New Self-Supervised Learning Strategy

Self-supervised learning (SSL) methods aim to learn view-invariant representations by maximizing the similarity between the features extracted from different crops of the same image regardless of cropping size and content. In essence, this strategy ignores the fact that two crops may truly contain different image information, e.g., background and small objects, and thus tends to restrain the diversity of the learned representations. In this work, we address this issue by introducing a new self-supervised learning strategy, LoGo, that explicitly reasons about Local and Global crops. To achieve view invariance, LoGo encourages similarity between global crops from the same image, as well as between a global and a local crop. However, to correctly encode the fact that the content of smaller crops may differ entirely, LoGo promotes two local crops to have dissimilar representations, while being close to global crops. Our LoGo strategy can easily be applied to existing SSL methods. Our extensive experiments on a variety of datasets and using different self-supervised learning frameworks validate its superiority over existing approaches. Noticeably, we achieve better results than supervised models on transfer learning when using only 1/10 of the data.

preprint2022arXiv

Minimax Regret Optimization for Robust Machine Learning under Distribution Shift

In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.

preprint2022arXiv

MulT: An End-to-End Multitask Learning Transformer

We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic segmentation, reshading, surface normal estimation, 2D keypoint detection, and edge detection. Based on the Swin transformer model, our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads. At the heart of our approach is a shared attention mechanism modeling the dependencies across the tasks. We evaluate our model on several multitask benchmarks, showing that our MulT framework outperforms both the state-of-the art multitask convolutional neural network models and all the respective single task transformer models. Our experiments further highlight the benefits of sharing attention across all the tasks, and demonstrate that our MulT model is robust and generalizes well to new domains. Our project website is at https://ivrl.github.io/MulT/.

preprint2022arXiv

Near Optimal Stochastic Algorithms for Finite-Sum Unbalanced Convex-Concave Minimax Optimization

This paper considers stochastic first-order algorithms for convex-concave minimax problems of the form $\min_{\bf x}\max_{\bf y}f(\bf x, \bf y)$, where $f$ can be presented by the average of $n$ individual components which are $L$-average smooth. For $μ_x$-strongly-convex-$μ_y$-strongly-concave setting, we propose a new method which could find a $\varepsilon$-saddle point of the problem in $\tilde{\mathcal O} \big(\sqrt{n(\sqrt{n}+κ_x)(\sqrt{n}+κ_y)}\log(1/\varepsilon)\big)$ stochastic first-order complexity, where $κ_x\triangleq L/μ_x$ and $κ_y\triangleq L/μ_y$. This upper bound is near optimal with respect to $\varepsilon$, $n$, $κ_x$ and $κ_y$ simultaneously. In addition, the algorithm is easily implemented and works well in practical. Our methods can be extended to solve more general unbalanced convex-concave minimax problems and the corresponding upper complexity bounds are also near optimal.

preprint2022arXiv

Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions

We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.

preprint2022arXiv

Noether-Severi inequality and equality for irregular threefolds of general type

We prove the optimal Noether-Severi inequality that $\mathrm{vol}(X) \ge \frac{4}{3} χ(ω_{X})$ for all smooth and irregular $3$-folds $X$ of general type over $\mathbb{C}$. For those $3$-folds $X$ attaining the equality, we completely describe their canonical models and show that the topological fundamental group $π_1(X) \simeq \mathbb{Z}^2$. As a corollary, we obtain for the same $X$ another optimal inequality that $\mathrm{vol}(X) \ge \frac{4}{3}h^0_a(X, K_X)$ where $h^0_a(X, K_X)$ stands for the continuous rank of $K_X$, and we show that $X$ attains this equality if and only if $\mathrm{vol}(X) = \frac{4}{3}χ(ω_{X})$.

preprint2022arXiv

On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data

Existing theory predicts that data heterogeneity will degrade the performance of the Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the simple FedAvg algorithm converges very well. This paper explains the seemingly unreasonable effectiveness of FedAvg that contradicts the previous theoretical predictions. We find that the key assumption of bounded gradient dissimilarity in previous theoretical analyses is too pessimistic to characterize data heterogeneity in practical applications. For a simple quadratic problem, we demonstrate there exist regimes where large gradient dissimilarity does not have any negative impact on the convergence of FedAvg. Motivated by this observation, we propose a new quantity, average drift at optimum, to measure the effects of data heterogeneity, and explicitly use it to present a new theoretical analysis of FedAvg. We show that the average drift at optimum is nearly zero across many real-world federated training tasks, whereas the gradient dissimilarity can be large. And our new analysis suggests FedAvg can have identical convergence rates in homogeneous and heterogeneous data settings, and hence, leads to better understanding of its empirical success.

preprint2022arXiv

OneDConv: Generalized Convolution For Transform-Invariant Representation

Convolutional Neural Networks (CNNs) have exhibited their great power in a variety of vision tasks. However, the lack of transform-invariant property limits their further applications in complicated real-world scenarios. In this work, we proposed a novel generalized one dimension convolutional operator (OneDConv), which dynamically transforms the convolution kernels based on the input features in a computationally and parametrically efficient manner. The proposed operator can extract the transform-invariant features naturally. It improves the robustness and generalization of convolution without sacrificing the performance on common images. The proposed OneDConv operator can substitute the vanilla convolution, thus it can be incorporated into current popular convolutional architectures and trained end-to-end readily. On several popular benchmarks, OneDConv outperforms the original convolution operation and other proposed models both in canonical and distorted images.

preprint2022arXiv

OpenMedIA: Open-Source Medical Image Analysis Toolbox and Benchmark under Heterogeneous AI Computing Platforms

In this paper, we present OpenMedIA, an open-source toolbox library containing a rich set of deep learning methods for medical image analysis under heterogeneous Artificial Intelligence (AI) computing platforms. Various medical image analysis methods, including 2D/3D medical image classification, segmentation, localisation, and detection, have been included in the toolbox with PyTorch and/or MindSpore implementations under heterogeneous NVIDIA and Huawei Ascend computing systems. To our best knowledge, OpenMedIA is the first open-source algorithm library providing compared PyTorch and MindSpore implementations and results on several benchmark datasets. The source codes and models are available at https://git.openi.org.cn/OpenMedIA.

preprint2022arXiv

Optimizing Latent Space Directions For GAN-based Local Image Editing

Generative Adversarial Network (GAN) based localized image editing can suffer from ambiguity between semantic attributes. We thus present a novel objective function to evaluate the locality of an image edit. By introducing the supervision from a pre-trained segmentation network and optimizing the objective function, our framework, called Locally Effective Latent Space Direction (LELSD), is applicable to any dataset and GAN architecture. Our method is also computationally fast and exhibits a high extent of disentanglement, which allows users to interactively perform a sequence of edits on an image. Our experiments on both GAN-generated and real images qualitatively demonstrate the high quality and advantages of our method.

preprint2022arXiv

Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.

preprint2022arXiv

RC-MVSNet: Unsupervised Multi-View Stereo with Neural Rendering

Finding accurate correspondences among different views is the Achilles' heel of unsupervised Multi-View Stereo (MVS). Existing methods are built upon the assumption that corresponding pixels share similar photometric features. However, multi-view images in real scenarios observe non-Lambertian surfaces and experience occlusions. In this work, we propose a novel approach with neural rendering (RC-MVSNet) to solve such ambiguity issues of correspondences among views. Specifically, we impose a depth rendering consistency loss to constrain the geometry features close to the object surface to alleviate occlusions. Concurrently, we introduce a reference view synthesis loss to generate consistent supervision, even for non-Lambertian surfaces. Extensive experiments on DTU and Tanks\&Temples benchmarks demonstrate that our RC-MVSNet approach achieves state-of-the-art performance over unsupervised MVS frameworks and competitive performance to many supervised methods.The code is released at https://github.com/Boese0601/RC-MVSNet

preprint2022arXiv

Real-space Observation of Incommensurate Spin Density Wave and Coexisting Charge Density Wave on Cr(001) surface

In itinerant magnetic systems, a spin density wave (SDW) state can be induced by Fermi surface nesting and electron-electron interaction. It may intertwine with other orders such as charge density wave (CDW), while their relation is still yet to be understood. Here via spin-polarized scanning tunneling microscopy, we directly observed long-range spin modulation on Cr(001) surface, which corresponds to the well-known incommensurate SDW of bulk Cr. It displays 6.0 nm in-plane period and anti-phase behavior between adjacent (001) planes. Meanwhile, we simultaneously observed the coexisting CDW with half the period of SDW. Such SDW/CDW have highly correlated domain structures and are in-phase. Surprisingly, the CDW displays a contrast inversion around a density-of-states dip at -22 meV, indicating an anomalous CDW gap opened below EF. These observations support that the CDW is a secondary order driven by SDW. Our work is not only a real-space characterization of incommensurate SDW, but also provides insights on how SDW and CDW coexist.

preprint2022arXiv

Relative Severi inequality for fibrations of maximal Albanese dimension over curves

Let $f: X \to B$ be a relatively minimal fibration of maximal Albanese dimension from a variety $X$ of dimension $n \ge 2$ to a curve $B$ defined over an algebraically closed field of characteristic zero. We prove that $K_{X/B}^n \ge 2n! χ_f$, which was conjectured by Barja in [2]. Via the strategy outlined in [5], it also leads to a new proof of the Severi inequality for varieties of maximal Albanese dimension. Moreover, when the equality holds and $χ_f > 0$, we prove that the general fiber $F$ of $f$ has to satisfy the Severi equality that $K_F^{n-1} = 2(n-1)! χ(F, ω_F)$. We also prove some sharper results of the same type under extra assumptions.

preprint2022arXiv

Secure Rate-Splitting for MIMO Broadcast Channel with Imperfect CSIT and a Jammer

In this paper, we investigate the secure rate-splitting for the two-user multiple-input multiple-output (MIMO) broadcast channel with imperfect channel state information at the transmitter (CSIT) and a multiple-antenna jammer, where each receiver has an equal number of antennas and the jammer has perfect channel state information (CSI). Specifically, we design a secure rate-splitting multiple-access strategy, where the security of split private and common messages is ensured by precoder design with joint nulling and aligning the leakage information, regarding different antenna configurations. Moreover, we show that the sum-secure degrees-of-freedom (SDoF) achieved by secure rate-splitting is optimal and outperforms that by conventional zero-forcing. Therefore, we reveal the sum-SDoF of the two-user MIMO broadcast channel with imperfect CSIT and a jammer, and validate the superiority of rate-splitting for the security purpose in this scenario with emphasis of MIMO.

preprint2022arXiv

Siamese Labels Auxiliary Learning

In deep learning, auxiliary training has been widely used to assist the training of models. During the training phase, using auxiliary modules to assist training can improve the performance of the model. During the testing phase, auxiliary modules can be removed, so the test parameters are not increased. In this paper, we propose a novel auxiliary training method, Siamese Labels Auxiliary Learning (SiLa). Unlike Deep Mutual Learning (DML), SiLa emphasizes auxiliary learning and can be easily combined with DML. In general, the main work of this paper include: (1) propose SiLa Learning, which improves the performance of common models without increasing test parameters; (2) compares SiLa with DML and proves that SiLa can improve the generalization of the model; (3) SiLa is applied to Dynamic Neural Networks, and proved that SiLa can be used for various types of network structures.

preprint2022arXiv

The DoF Region of Two-User MIMO Broadcast Channel with Delayed Imperfect-Quality CSIT

The channel state information at the transmitter (CSIT) play an important role in the performance of wireless networks. The CSIT model can be delayed and imperfect-quality, since the feedback link has a delay and the channel state information (CSI) feedback has distortion. In this paper, we thus characterize the degrees-of-freedom (DoF) region of the two-user multiple-input multiple-output (MIMO) broadcast channel with delayed imperfect-quality CSIT, where the antenna configurations can be arbitrary. The converse proof of DoF region is based on the enhancement of physically degraded channel. The achievability proof of DoF region is through a novel transmission scheme design, where the duration of each phase and the amount of transmitted symbols are configured based on the imperfect-quality of delayed CSIT. As a result, we show that the DoF region with delayed imperfect-quality CSIT is located between the DoF region with no CSIT and the DoF region with delayed CSIT.

preprint2022arXiv

Time Series Generation with Masked Autoencoder

This paper shows that masked autoencoder with extrapolator (ExtraMAE) is a scalable self-supervised model for time series generation. ExtraMAE randomly masks some patches of the original time series and learns temporal dynamics by recovering the masked patches. Our approach has two core designs. First, ExtraMAE is self-supervised. Supervision allows ExtraMAE to effectively and efficiently capture the temporal dynamics of the original time series. Second, ExtraMAE proposes an extrapolator to disentangle two jobs of the decoder: recovering latent representations and mapping them back into the feature space. These unique designs enable ExtraMAE to consistently and significantly outperform state-of-the-art (SoTA) benchmarks in time series generation. The lightweight architecture also makes ExtraMAE fast and scalable. ExtraMAE shows outstanding behavior in various downstream tasks such as time series classification, prediction, and imputation. As a self-supervised generative model, ExtraMAE allows explicit management of the synthetic data. We hope this paper will usher in a new era of time series generation with self-supervised models.

preprint2021arXiv

DeEPCA: Decentralized Exact PCA with Linear Convergence Rate

Due to the rapid growth of smart agents such as weakly connected computational nodes and sensors, developing decentralized algorithms that can perform computations on local agents becomes a major research direction. This paper considers the problem of decentralized Principal components analysis (PCA), which is a statistical method widely used for data analysis. We introduce a technique called subspace tracking to reduce the communication cost, and apply it to power iterations. This leads to a decentralized PCA algorithm called \texttt{DeEPCA}, which has a convergence rate similar to that of the centralized PCA, while achieving the best communication complexity among existing decentralized PCA algorithms. \texttt{DeEPCA} is the first decentralized PCA algorithm with the number of communication rounds for each power iteration independent of target precision. Compared to existing algorithms, the proposed method is easier to tune in practice, with an improved overall communication cost. Our experiments validate the advantages of \texttt{DeEPCA} empirically.

preprint2021arXiv

Evidence of topological nodal lines and surface states in the centrosymmetric superconductor SnTaS2

The discovery of signatures of topological superconductivity in superconducting bulk materials with topological surface states has attracted intensive research interests recently. Utilizing angle-resolved photoemission spectroscopy and first-principles calculations, here, we demonstrate the existence of topological nodal-line states and drumheadlike surface states in centrosymmetric superconductor SnTaS2, which is a type-II superconductor with a critical transition temperature of about 3 K. The valence bands from Ta 5d orbitals and the conduction bands from Sn 5p orbitals cross each other, forming two nodal lines in the vicinity of the Fermi energy without the inclusion of spin-orbit coupling (SOC), protected by the spatial-inversion symmetry and time-reversal symmetry. The nodal lines are gapped out by SOC. The drumheadlike surface states, the typical characteristics in nodal-line semimetals, are quite visible near the Fermi level. Our findings indicate that SnTaS2 offers a promising platform for exploring the exotic properties of the topological nodal-line fermions and gives a help to study topological superconductivity.

preprint2021arXiv

Hierarchical Neural Architecture Search via Operator Clustering

Recently, the efficiency of automatic neural architecture design has been significantly improved by gradient-based search methods such as DARTS. However, recent literature has brought doubt to the generalization ability of DARTS, arguing that DARTS performs poorly when the search space is changed, i.e, when different set of candidate operators are used. Regularization techniques such as early stopping have been proposed to partially solve this problem. In this paper, we tackle this problem from a different perspective by identifying two contributing factors to the collapse of DARTS when the search space changes: (1) the correlation of similar operators incurs unfavorable competition among them and makes their relative importance score unreliable and (2) the optimization complexity gap between the proxy search stage and the final training. Based on these findings, we propose a new hierarchical search algorithm. With its operator clustering and optimization complexity match, the algorithm can consistently find high-performance architecture across various search spaces. For all the five variants of the popular cell-based search spaces, the proposed algorithm always obtains state-of-the-art architecture with best accuracy on the CIFAR-10, CIFAR-100 and ImageNet over other well-established DARTS-alike algorithms. Code is available at https://github.com/susan0199/StacNAS.

preprint2021arXiv

Nondiscriminatory Treatment: a straightforward framework for multi-human parsing

Multi-human parsing aims to segment every body part of every human instance. Nearly all state-of-the-art methods follow the "detection first" or "segmentation first" pipelines. Different from them, we present an end-to-end and box-free pipeline from a new and more human-intuitive perspective. In training time, we directly do instance segmentation on humans and parts. More specifically, we introduce a notion of "indiscriminate objects with categorie" which treats humans and parts without distinction and regards them both as instances with categories. In the mask prediction, each binary mask is obtained by a combination of prototypes shared among all human and part categories. In inference time, we design a brand-new grouping post-processing method that relates each part instance with one single human instance and groups them together to obtain the final human-level parsing result. We name our method as Nondiscriminatory Treatment between Humans and Parts for Human Parsing (NTHP). Experiments show that our network performs superiorly against state-of-the-art methods by a large margin on the MHP v2.0 and PASCAL-Person-Part datasets.

preprint2021arXiv

On Secure Degrees of Freedom of the MIMO Interference Channel with Local Output Feedback

This paper studies the problem of sum-secure degrees of freedom (SDoF) of the (M,M,N,N) multiple-input multiple-output (MIMO) interference channel with local output feedback, so as to build an information-theoretic foundation and provide practical transmission schemes for 6G-enabled vehicles-to-vehicles (V2V). For this problem, we propose two novel transmission schemes, i.e., the interference decoding scheme and the interference alignment scheme, and thus establish a sum-SDoF lower bound. In particular, to optimize the phase duration, we analyze the security and decoding constraints and formulate a linear-fractional optimization problem. Furthermore, we show that the derived sum-SDoF lower bound is the sum-SDoF for M <= N/2, N=M, and 2N <= M antenna configurations, and reveal that for a fixed N, the optimal M to maximize the sum-SDoF is not less than 2N. Through simulations, we examine the secure sum-rate performance of proposed transmission schemes, and reveal that using local output feedback can lead to a higher secure sum-rate than that by using delayed channel state information at the transmitter.

preprint2021arXiv

The design of the Ali CMB Polarization Telescope receiver

Ali CMB Polarization Telescope (AliCPT-1) is the first CMB degree-scale polarimeter to be deployed on the Tibetan plateau at 5,250m above sea level. AliCPT-1 is a 90/150 GHz 72 cm aperture, two-lens refracting telescope cooled down to 4 K. Alumina lenses, 800mm in diameter, image the CMB in a 33.4° field of view on a 636mm wide focal plane. The modularized focal plane consists of dichroic polarization-sensitive Transition-Edge Sensors (TESes). Each module includes 1,704 optically active TESes fabricated on a 150mm diameter silicon wafer. Each TES array is read out with a microwave multiplexing readout system capable of a multiplexing factor up to 2,048. Such a large multiplexing factor has allowed the practical deployment of tens of thousands of detectors, enabling the design of a receiver that can operate up to 19 TES arrays for a total of 32,376 TESes. AliCPT-1 leverages the technological advancements in the detector design from multiple generations of previously successful feedhorn-coupled polarimeters, and in the instrument design from BICEP-3, but applied on a larger scale. The cryostat receiver is currently under integration and testing. During the first deployment year, the focal plane will be populated with up to 4 TES arrays. Further TES arrays will be deployed in the following years, fully populating the focal plane with 19 arrays on the fourth deployment year. Here we present the AliCPT-1 receiver design, and how the design has been optimized to meet the experimental requirements.

preprint2021arXiv

Towards Unbiased COVID-19 Lesion Localisation and Segmentation via Weakly Supervised Learning

Despite tremendous efforts, it is very challenging to generate a robust model to assist in the accurate quantification assessment of COVID-19 on chest CT images. Due to the nature of blurred boundaries, the supervised segmentation methods usually suffer from annotation biases. To support unbiased lesion localisation and to minimise the labeling costs, we propose a data-driven framework supervised by only image-level labels. The framework can explicitly separate potential lesions from original images, with the help of a generative adversarial network and a lesion-specific decoder. Experiments on two COVID-19 datasets demonstrate the effectiveness of the proposed framework and its superior performance to several existing methods.

preprint2020arXiv

Accelerated Dual-Averaging Primal-Dual Method for Composite Convex Minimization

Dual averaging-type methods are widely used in industrial machine learning applications due to their ability to promoting solution structure (e.g., sparsity) efficiently. In this paper, we propose a novel accelerated dual-averaging primal-dual algorithm for minimizing a composite convex function. We also derive a stochastic version of the proposed method which solves empirical risk minimization, and its advantages on handling sparse data are demonstrated both theoretically and empirically.

preprint2020arXiv

Achievable DoF Regions of Three-User MIMO Broadcast Channel with Delayed CSIT

For the two-user multiple-input multiple-output (MIMO) broadcast channel with delayed channel state information at the transmitter (CSIT) and arbitrary antenna configurations, all the degrees-of-freedom (DoF) regions are obtained. However, for the three-user MIMO broadcast channel with delayed CSIT and arbitrary antenna configurations, the DoF region of order-2 messages is still unclear and only a partial achievable DoF region of order-1 messages is obtained, where the order-2 messages and order-1 messages are desired by two receivers and one receiver, respectively. In this paper, for the three-user MIMO broadcast channel with delayed CSIT and arbitrary antenna configurations, we first design transmission schemes for order-2 messages and order-1 messages. Next, we propose to analyze the achievable DoF region of transmission scheme by transformation approach. In particular, we transform the decoding condition of transmission scheme w.r.t. phase duration into the achievable DoF region w.r.t. achievable DoF, through achievable DoF tuple expression connecting phase duration and achievable DoF. As a result, the DoF region of order-2 messages is characterized and an achievable DoF region of order-1 messages is completely expressed. Besides, for order-1 messages, we derive the sufficient condition, under which the proposed achievable DoF region is the DoF region.

preprint2020arXiv

Age-of-Information-based Scheduling in Multiuser Uplinks with Stochastic Arrivals: A POMDP Approach

In this paper, we consider a multiuser uplink status update system, where a monitor aims to timely collect randomly generated status updates from multiple end nodes through a shared wireless channel. We adopt the recently proposed metric, termed age of information (AoI), to quantify the information timeliness and freshness. Due to the random generation of the status updates at the end node side, the monitor only grasps a partial knowledge of the status update arrivals. Under such a practical scenario, we aim to address a fundamental multiuser scheduling problem: how to schedule the end nodes to minimize the network-wide AoI? To solve this problem, we formulate it as a partially observable Markov decision process (POMDP), and develop a dynamic programming (DP) algorithm to obtain the optimal scheduling policy. By noting that the optimal policy is computationally prohibitive, we further design a low-complexity myopic policy that only minimizes the one-step expected reward. Simulation results show that the performance of the myopic policy can approach that of the optimal policy, and is better than that of the baseline policy.

preprint2020arXiv

Bidirectional Generative Modeling Using Adversarial Gradient Estimation

This paper considers the general $f$-divergence formulation of bidirectional generative modeling, which includes VAE and BiGAN as special cases. We present a new optimization method for this formulation, where the gradient is computed using an adversarially learned discriminator. In our framework, we show that different divergences induce similar algorithms in terms of gradient evaluation, except with different scaling. Therefore this paper gives a general recipe for a class of principled $f$-divergence based generative modeling methods. Theoretical justifications and extensive empirical studies are provided to demonstrate the advantage of our approach over existing methods.

preprint2020arXiv

Black-Box Adversarial Attack with Transferable Model-based Embedding

We present a new method for black-box adversarial attack. Unlike previous methods that combined transfer-based and scored-based methods by using the gradient or initialization of a surrogate white-box model, this new method tries to learn a low-dimensional embedding using a pretrained model, and then performs efficient search within the embedding space to attack an unknown target network. The method produces adversarial perturbations with high level semantic patterns that are easily transferable. We show that this approach can greatly improve the query efficiency of black-box adversarial attack across different target network architectures. We evaluate our approach on MNIST, ImageNet and Google Cloud Vision API, resulting in a significant reduction on the number of queries. We also attack adversarially defended networks on CIFAR10 and ImageNet, where our method not only reduces the number of queries, but also improves the attack success rate.

preprint2020arXiv

CATCH: Context-based Meta Reinforcement Learning for Transferrable Architecture Search

Neural Architecture Search (NAS) achieved many breakthroughs in recent years. In spite of its remarkable progress, many algorithms are restricted to particular search spaces. They also lack efficient mechanisms to reuse knowledge when confronting multiple tasks. These challenges preclude their applicability, and motivate our proposal of CATCH, a novel Context-bAsed meTa reinforcement learning (RL) algorithm for transferrable arChitecture searcH. The combination of meta-learning and RL allows CATCH to efficiently adapt to new tasks while being agnostic to search spaces. CATCH utilizes a probabilistic encoder to encode task properties into latent context variables, which then guide CATCH's controller to quickly "catch" top-performing networks. The contexts also assist a network evaluator in filtering inferior candidates and speed up learning. Extensive experiments demonstrate CATCH's universality and search efficiency over many other widely-recognized algorithms. It is also capable of handling cross-domain architecture search as competitive networks on ImageNet, COCO, and Cityscapes are identified. This is the first work to our knowledge that proposes an efficient transferrable NAS solution while maintaining robustness across various settings.

preprint2020arXiv

Deformable Slice-to-Volume Registration for Motion Correction in Fetal Body MRI

In in-utero MRI, motion correction for fetal body and placenta poses a particular challenge due to the presence of local non-rigid transformations of organs caused by bending and stretching. The existing slice-to-volume registration (SVR) reconstruction methods are widely employed for motion correction of fetal brain that undergoes only rigid transformation. However, for reconstruction of fetal body and placenta, rigid registration cannot resolve the issue of misregistrations due to deformable motion, resulting in degradation of features in the reconstructed volume. We propose a Deformable SVR (DSVR), a novel approach for non-rigid motion correction of fetal MRI based on a hierarchical deformable SVR scheme to allow high resolution reconstruction of the fetal body and placenta. Additionally, a robust scheme for structure-based rejection of outliers minimises the impact of registration errors. The improved performance of DSVR in comparison to SVR and patch-to-volume registration (PVR) methods is quantitatively demonstrated in simulated experiments and 20 fetal MRI datasets from 28-31 weeks gestational age (GA) range with varying degree of motion corruption. In addition, we present qualitative evaluation of 100 fetal body cases from 20-34 weeks GA range.

preprint2020arXiv

Discovery of oscillations above 200 keV in a black hole X-ray binary with Insight-HXMT

Low-frequency quasi-periodic oscillations (LFQPOs) are commonly found in black hole X-ray binaries, and their origin is still under debate. The properties of LFQPOs at high energies (above 30 keV) are closely related to the nature of the accretion flow in the innermost regions, and thus play a crucial role in critically testing various theoretical models. The Hard X-ray Modulation Telescope (Insight-HXMT) is capable of detecting emissions above 30 keV, and is therefore an ideal instrument to do so. Here we report the discovery of LFQPOs above 200 keV in the new black hole MAXI J1820+070 in the X-ray hard state, which allows us to understand the behaviours of LFQPOs at hundreds of kiloelectronvolts. The phase lag of the LFQPO is constant around zero below 30 keV, and becomes a soft lag (that is, the high-energy photons arrive first) above 30 keV. The soft lag gradually increases with energy and reaches ~0.9s in the 150-200 keV band. The detection at energies above 200 keV, the large soft lag and the energy-related behaviors of the LFQPO pose a great challenge for most currently existing models, but suggest that the LFQPO probably originates from the precession of a small-scale jet.

preprint2020arXiv

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

A standard approach in large scale machine learning is distributed stochastic gradient training, which requires the computation of aggregated stochastic gradients over multiple nodes on a network. Communication is a major bottleneck in such applications, and in recent years, compressed stochastic gradient methods such as QSGD (quantized SGD) and sparse SGD have been proposed to reduce communication. It was also shown that error compensation can be combined with compression to achieve better convergence in a scheme that each node compresses its local stochastic gradient and broadcast the result to all other nodes over the network in a single pass. However, such a single pass broadcast approach is not realistic in many practical implementations. For example, under the popular parameter server model for distributed learning, the worker nodes need to send the compressed local gradients to the parameter server, which performs the aggregation. The parameter server has to compress the aggregated stochastic gradient again before sending it back to the worker nodes. In this work, we provide a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server. We show that the error-compensated stochastic gradient algorithm admits three very nice properties: 1) it is compatible with an \emph{arbitrary} compression technique; 2) it admits an improved convergence rate than the non error-compensated stochastic gradient methods such as QSGD and sparse SGD; 3) it admits linear speedup with respect to the number of workers. The empirical study is also conducted to validate our theoretical results.

preprint2020arXiv

End-to-end Learning for Inter-Vehicle Distance and Relative Velocity Estimation in ADAS with a Monocular Camera

Inter-vehicle distance and relative velocity estimations are two basic functions for any ADAS (Advanced driver-assistance systems). In this paper, we propose a monocular camera-based inter-vehicle distance and relative velocity estimation method based on end-to-end training of a deep neural network. The key novelty of our method is the integration of multiple visual clues provided by any two time-consecutive monocular frames, which include deep feature clue, scene geometry clue, as well as temporal optical flow clue. We also propose a vehicle-centric sampling mechanism to alleviate the effect of perspective distortion in the motion field (i.e. optical flow). We implement the method by a light-weight deep neural network. Extensive experiments are conducted which confirm the superior performance of our method over other state-of-the-art methods, in terms of estimation accuracy, computational speed, and memory footprint.

preprint2020arXiv

Graph Inference Learning for Semi-supervised Classification

In this work, we address semi-supervised classification of graph data, where the categories of those unlabeled nodes are inferred from labeled nodes as well as graph structures. Recent works often solve this problem via advanced graph convolution in a conventionally supervised manner, but the performance could degrade significantly when labeled data is scarce. To this end, we propose a Graph Inference Learning (GIL) framework to boost the performance of semi-supervised node classification by learning the inference of node labels on graph topology. To bridge the connection between two nodes, we formally define a structure relation by encapsulating node attributes, between-node paths, and local topological structures together, which can make the inference conveniently deduced from one node to another node. For learning the inference process, we further introduce meta-optimization on structure relations from training nodes to validation nodes, such that the learnt graph inference capability can be better self-adapted to testing nodes. Comprehensive evaluations on four benchmark datasets (including Cora, Citeseer, Pubmed, and NELL) demonstrate the superiority of our proposed GIL when compared against state-of-the-art methods on the semi-supervised node classification task.

preprint2020arXiv

Graph Wasserstein Correlation Analysis for Movie Retrieval

Movie graphs play an important role to bridge heterogenous modalities of videos and texts in human-centric retrieval. In this work, we propose Graph Wasserstein Correlation Analysis (GWCA) to deal with the core issue therein, i.e, cross heterogeneous graph comparison. Spectral graph filtering is introduced to encode graph signals, which are then embedded as probability distributions in a Wasserstein space, called graph Wasserstein metric learning. Such a seamless integration of graph signal filtering together with metric learning results in a surprise consistency on both learning processes, in which the goal of metric learning is just to optimize signal filters or vice versa. Further, we derive the solution of the graph comparison model as a classic generalized eigenvalue decomposition problem, which has an exactly closed-form solution. Finally, GWCA together with movie/text graphs generation are unified into the framework of movie retrieval to evaluate our proposed method. Extensive experiments on MovieGrpahs dataset demonstrate the effectiveness of our GWCA as well as the entire framework.

preprint2020arXiv

Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization

This paper presents a framework of successive functional gradient optimization for training nonconvex models such as neural networks, where training is driven by mirror descent in a function space. We provide a theoretical analysis and empirical study of the training method derived from this framework. It is shown that the method leads to better performance than that of standard training techniques.

preprint2020arXiv

Instance-Aware Graph Convolutional Network for Multi-Label Classification

Graph convolutional neural network (GCN) has effectively boosted the multi-label image recognition task by introducing label dependencies based on statistical label co-occurrence of data. However, in previous methods, label correlation is computed based on statistical information of data and therefore the same for all samples, and this makes graph inference on labels insufficient to handle huge variations among numerous image instances. In this paper, we propose an instance-aware graph convolutional neural network (IA-GCN) framework for multi-label classification. As a whole, two fused branches of sub-networks are involved in the framework: a global branch modeling the whole image and a region-based branch exploring dependencies among regions of interests (ROIs). For label diffusion of instance-awareness in graph convolution, rather than using the statistical label correlation alone, an image-dependent label correlation matrix (LCM), fusing both the statistical LCM and an individual one of each image instance, is constructed for graph inference on labels to inject adaptive information of label-awareness into the learned features of the model. Specifically, the individual LCM of each image is obtained by mining the label dependencies based on the scores of labels about detected ROIs. In this process, considering the contribution differences of ROIs to multi-label classification, variational inference is introduced to learn adaptive scaling factors for those ROIs by considering their complex distribution. Finally, extensive experiments on MS-COCO and VOC datasets show that our proposed approach outperforms existing state-of-the-art methods.

preprint2020arXiv

LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition

In recent years, deep learning has achieved great success in many natural language processing tasks including named entity recognition. The shortcoming is that a large amount of manually-annotated data is usually required. Previous studies have demonstrated that active learning could elaborately reduce the cost of data annotation, but there is still plenty of room for improvement. In real applications we found existing uncertainty-based active learning strategies have two shortcomings. Firstly, these strategies prefer to choose long sequence explicitly or implicitly, which increase the annotation burden of annotators. Secondly, some strategies need to invade the model and modify to generate some additional information for sample selection, which will increase the workload of the developer and increase the training/prediction time of the model. In this paper, we first examine traditional active learning strategies in a specific case of BiLstm-CRF that has widely used in named entity recognition on several typical datasets. Then we propose an uncertainty-based active learning strategy called Lowest Token Probability (LTP) which combines the input and output of CRF to select informative instance. LTP is simple and powerful strategy that does not favor long sequences and does not need to invade the model. We test LTP on multiple datasets, and the experiments show that LTP performs slightly better than traditional strategies with obviously less annotation tokens on both sentence-level accuracy and entity-level F1-score. Related code have been release on https://github.com/HIT-ICES/AL-NER

preprint2020arXiv

MAP Inference via L2-Sphere Linear Program Reformulation

Maximum a posteriori (MAP) inference is an important task for graphical models. Due to complex dependencies among variables in realistic model, finding an exact solution for MAP inference is often intractable. Thus, many approximation methods have been developed, among which the linear programming (LP) relaxation based methods show promising performance. However, one major drawback of LP relaxation is that it is possible to give fractional solutions. Instead of presenting a tighter relaxation, in this work we propose a continuous but equivalent reformulation of the original MAP inference problem, called LS-LP. We add the L2-sphere constraint onto the original LP relaxation, leading to an intersected space with the local marginal polytope that is equivalent to the space of all valid integer label configurations. Thus, LS-LP is equivalent to the original MAP inference problem. We propose a perturbed alternating direction method of multipliers (ADMM) algorithm to optimize the LS-LP problem, by adding a sufficiently small perturbation epsilon onto the objective function and constraints. We prove that the perturbed ADMM algorithm globally converges to the epsilon-Karush-Kuhn-Tucker (epsilon-KKT) point of the LS-LP problem. The convergence rate will also be analyzed. Experiments on several benchmark datasets from Probabilistic Inference Challenge (PIC 2011) and OpenGM 2 show competitive performance of our proposed method against state-of-the-art MAP inference methods.

preprint2020arXiv

MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation

Many recently proposed methods for Neural Architecture Search (NAS) can be formulated as bilevel optimization. For efficient implementation, its solution requires approximations of second-order methods. In this paper, we demonstrate that gradient errors caused by such approximations lead to suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. To remedy this, this paper proposes \mldas, a mixed-level reformulation for NAS that can be optimized efficiently and reliably. It is shown that even when using a simple first-order method on the mixed-level formulation, \mldas\ can achieve a lower validation error for NAS problems. Consequently, architectures obtained by our method achieve consistently higher accuracies than those obtained from bilevel optimization. Moreover, \mldas\ proposes a framework beyond DARTS. It is upgraded via model size-based search and early stopping strategies to complete the search process in around 5 hours. Extensive experiments within the convolutional architecture search space validate the effectiveness of our approach.

preprint2020arXiv

Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

This paper proposes a new mean-field framework for over-parameterized deep neural networks (DNNs), which can be used to analyze neural network training. In this framework, a DNN is represented by probability measures and functions over its features (that is, the function values of the hidden units over the training data) in the continuous limit, instead of the neural network parameters as most existing studies have done. This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, and further leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re-parameterization. Moreover, we construct a non-linear dynamics called neural feature flow, which captures the evolution of an over-parameterized DNN trained by Gradient Descent. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures. Furthermore, we show, for Res-Net, when the neural feature flow process converges, it reaches a global minimal solution under suitable conditions. Our analysis leads to the first global convergence proof for over-parameterized neural network training with more than $3$ layers in the mean-field regime.

preprint2020arXiv

NetML: A Challenge for Network Traffic Analytics

Classifying network traffic is the basis for important network applications. Prior research in this area has faced challenges on the availability of representative datasets, and many of the results cannot be readily reproduced. Such a problem is exacerbated by emerging data-driven machine learning based approaches. To address this issue, we provide three open datasets containing almost 1.3M labeled flows in total, with flow features and anonymized raw packets, for the research community. We focus on broad aspects in network traffic analysis, including both malware detection and application classification. We release the datasets in the form of an open challenge called NetML and implement several machine learning methods including random-forest, SVM and MLP. As we continue to grow NetML, we expect the datasets to serve as a common platform for AI driven, reproducible research on network flow analytics.

preprint2020arXiv

Observation of discrete conventional Caroli-de Gennes-Matricon states in the vortex core of single-layer FeSe/SrTiO3

Using low-temperature scanning tunneling microscopy (STM), we studied the vortex states of single-layer FeSe film on SrTiO3 (100) substrate, and the local behaviors of superconductivity at sample boundaries. We clearly observed multiple discrete Caroli-de Gennes-Matricon (CdGM) states in the vortex core, and quantitative analysis shows their energies well follow the formula: E = μΔ^2/E_F, where μ is a half integer and Δ is the mean superconducting gap over the Fermi surface. Meanwhile, a fully gapped spectrum without states near zero bias is observed at [110](Fe) oriented boundary of 1 ML and 2 ML FeSe films, and atomic step edge of 1 ML FeSe. Accompanied with theoretical calculations, our results indicate a s-wave pairing without sign-change in the high-TC FeSe_SrTiO3 superconductor.

preprint2020arXiv

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

We describe a new library named picasso, which implements a unified framework of pathwise coordinate optimization for a variety of sparse learning problems (e.g., sparse linear regression, sparse logistic regression, sparse Poisson regression and scaled sparse linear regression) combined with efficient active set selection strategies. Besides, the library allows users to choose different sparsity-inducing regularizers, including the convex $\ell_1$, nonconvex MCP and SCAD regularizers. The library is coded in C++ and has user-friendly R and Python wrappers. Numerical experiments demonstrate that picasso can scale up to large problems efficiently.

preprint2020arXiv

Synthetic Learning: Learn From Distributed Asynchronized Discriminator GAN Without Sharing Medical Image Data

In this paper, we propose a data privacy-preserving and communication efficient distributed GAN learning framework named Distributed Asynchronized Discriminator GAN (AsynDGAN). Our proposed framework aims to train a central generator learns from distributed discriminator, and use the generated synthetic image solely to train the segmentation model.We validate the proposed framework on the application of health entities learning problem which is known to be privacy sensitive. Our experiments show that our approach: 1) could learn the real image's distribution from multiple datasets without sharing the patient's raw data. 2) is more efficient and requires lower bandwidth than other distributed deep learning methods. 3) achieves higher performance compared to the model trained by one real dataset, and almost the same performance compared to the model trained by all real datasets. 4) has provable guarantees that the generator could learn the distributed distribution in an all important fashion thus is unbiased.

preprint2020arXiv

Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning

In existing visual representation learning tasks, deep convolutional neural networks (CNNs) are often trained on images annotated with single tags, such as ImageNet. However, a single tag cannot describe all important contents of one image, and some useful visual information may be wasted during training. In this work, we propose to train CNNs from images annotated with multiple tags, to enhance the quality of visual representation of the trained CNN model. To this end, we build a large-scale multi-label image database with 18M images and 11K categories, dubbed Tencent ML-Images. We efficiently train the ResNet-101 model with multi-label outputs on Tencent ML-Images, taking 90 hours for 60 epochs, based on a large-scale distributed deep learning framework,i.e.,TFplus. The good quality of the visual representation of the Tencent ML-Images checkpoint is verified through three transfer learning tasks, including single-label image classification on ImageNet and Caltech-256, object detection on PASCAL VOC 2007, and semantic segmentation on PASCAL VOC 2012. The Tencent ML-Images database, the checkpoints of ResNet-101, and all the training codehave been released at https://github.com/Tencent/tencent-ml-images. It is expected to promote other vision tasks in the research and industry community.

preprint2020arXiv

Towards Purely Unsupervised Disentanglement of Appearance and Shape for Person Images Generation

There have been a fairly of research interests in exploring the disentanglement of appearance and shape from human images. Most existing endeavours pursuit this goal by either using training images with annotations or regulating the training process with external clues such as human skeleton, body segmentation or cloth patches etc. In this paper, we aim to address this challenge in a more unsupervised manner---we do not require any annotation nor any external task-specific clues. To this end, we formulate an encoder-decoder-like network to extract both the shape and appearance features from input images at the same time, and train the parameters by three losses: feature adversarial loss, color consistency loss and reconstruction loss. The feature adversarial loss mainly impose little to none mutual information between the extracted shape and appearance features, while the color consistency loss is to encourage the invariance of person appearance conditioned on different shapes. More importantly, our unsupervised (Unsupervised learning has many interpretations in different tasks. To be clear, in this paper, we refer unsupervised learning as learning without task-specific human annotations, pairs or any form of weak supervision.) framework utilizes learned shape features as masks which are applied to the input itself in order to obtain clean appearance features. Without using fixed input human skeleton, our network better preserves the conditional human posture while requiring less supervision. Experimental results on DeepFashion and Market1501 demonstrate that the proposed method achieves clean disentanglement and is able to synthesis novel images of comparable quality with state-of-the-art weakly-supervised or even supervised methods.

preprint2020arXiv

UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders

In this paper, we propose the first framework (UCNet) to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection methods treat the saliency detection task as a point estimation problem, and produce a single saliency map following a deterministic learning pipeline. Inspired by the saliency data labeling process, we propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space. With the proposed saliency consensus process, we are able to generate an accurate saliency map based on these multiple predictions. Quantitative and qualitative evaluations on six challenging benchmark datasets against 18 competing algorithms demonstrate the effectiveness of our approach in learning the distribution of saliency maps, leading to a new state-of-the-art in RGB-D saliency detection.

preprint2020arXiv

Walk-Steered Convolution for Graph Classification

Graph classification is a fundamental but challenging issue for numerous real-world applications. Despite recent great progress in image/video classification, convolutional neural networks (CNNs) cannot yet cater to graphs well because of graphical non-Euclidean topology. In this work, we propose a walk-steered convolutional (WSC) network to assemble the essential success of standard convolutional neural networks as well as the powerful representation ability of random walk. Instead of deterministic neighbor searching used in previous graphical CNNs, we construct multi-scale walk fields (a.k.a. local receptive fields) with random walk paths to depict subgraph structures and advocate graph scalability. To express the internal variations of a walk field, Gaussian mixture models are introduced to encode principal components of walk paths therein. As an analogy to a standard convolution kernel on image, Gaussian models implicitly coordinate those unordered vertices/nodes and edges in a local receptive field after projecting to the gradient space of Gaussian parameters. We further stack graph coarsening upon Gaussian encoding by using dynamic clustering, such that high-level semantics of graph can be well learned like the conventional pooling on image. The experimental results on several public datasets demonstrate the superiority of our proposed WSC method over many state-of-the-arts for graph classification.

preprint2019arXiv

Overview to the Hard X-ray Modulation Telescope (Insight-HXMT) Satellite

As China's first X-ray astronomical satellite, the Hard X-ray Modulation Telescope (HXMT), which was dubbed as Insight-HXMT after the launch on June 15, 2017, is a wide-band (1-250 keV) slat-collimator-based X-ray astronomy satellite with the capability of all-sky monitoring in 0.2-3 MeV. It was designed to perform pointing, scanning and gamma-ray burst (GRB) observations and, based on the Direct Demodulation Method (DDM), the image of the scanned sky region can be reconstructed. Here we give an overview of the mission and its progresses, including payload, core sciences, ground calibration/facility, ground segment, data archive, software, in-orbit performance, calibration, background model, observations and some preliminary results.

preprint2016arXiv

Chromatic Effect for THz Generation in a Novel Wave-front Tilt Scheme

Deriving single or few cycle terahertz pulse (THz) by intense femtosecond laser through cascaded optical rectification in electro-optic crystals is a crucial technique in cutting-edge time-resolved spectroscopy to characterize micro-scale structures and ultrafast dynamics. In the past decade, lithium niobate (LN) crystal implementation of wave-front tilt scheme has been prevalently used, while painstaking efforts have been invested in order to achieve higher THz conversion efficiency. In this research we developed a brand new type of LN crystal possessing dual-face-cut and Brewster coupling, and conducted experimental and simulative investigation systematically to optimize the multi-dimensionally entangled parameters in THz generation, predicting the extreme conversion efficiency of 10% is potentially promising at the THz absorption coefficient of 0.5cm-1. More remarkably, we first discovered that the chirp of the driving laser pulse plays a decisive role in the wave-front tilt scheme, and the THz generation efficiency could be enhanced tremendously by applying an appropriate chirp.

preprint2016arXiv

Convolutional Neural Networks for Text Categorization: Shallow Word-level vs. Deep Character-level

This paper reports the performances of shallow word-level convolutional neural networks (CNN), our earlier work (2015), on the eight datasets with relatively large training data that were used for testing the very deep character-level CNN in Conneau et al. (2016). Our findings are as follows. The shallow word-level CNNs achieve better error rates than the error rates reported in Conneau et al., though the results should be interpreted with some consideration due to the unique pre-processing of Conneau et al. The shallow word-level CNN uses more parameters and therefore requires more storage than the deep character-level CNN; however, the shallow word-level CNN computes much faster.

preprint2016arXiv

Efficient Distributed Learning with Sparsity

We propose a novel, efficient approach for distributed sparse learning in high-dimensions, where observations are randomly partitioned across machines. Computationally, at each round our method only requires the master machine to solve a shifted ell_1 regularized M-estimation problem, and other workers to compute the gradient. In respect of communication, the proposed approach provably matches the estimation error bound of centralized methods within constant rounds of communications (ignoring logarithmic factors). We conduct extensive experiments on both simulated and real world datasets, and demonstrate encouraging performances on high-dimensional regression and classification tasks.

preprint2016arXiv

Isolating Mice and Elephant in Data Centers

Data centers traffic is composed by numerous latency-sensitive "mice" flows, which is consisted of only several packets, and a few throughput-sensitive "elephant" flows, which occupy more than 80% of overall load. Generally, the short-lived "mice" flows induce transient congestion and the long-lived "elephant" flows cause persistent congestion. The network congestion is a major performance inhibitor. Conventionally, the hop-by-hop and end-to-end flow control mechanisms are employed to relief transient and persistent congestion, respectively. However, in face of the mixture of elephants and mice, we find the hybrid congestion control scheme including hop-by-hop and end-to-end flow control mechanisms suffers from serious performance impairments. As a step further, our in-depth analysis reveals that the hybrid scheme performs poor at latency of mice and throughput of elephant. Motivated by this understanding, we argue for isolating mice and elephants in different queues, such that the hop-by-hop and end-to-end flow control mechanisms are independently imposed to short-lived and long-lived flows, respectively. Our solution is readily-deployable and compatible with current commodity network devices and can leverage various congestion control mechanisms. Extensive simulations show that our proposal of isolation can simultaneously improve the latency of mice by at least 30% and the link utilization to almost 100%.

preprint2016arXiv

Learning Sparse Low-Threshold Linear Classifiers

We consider the problem of learning a non-negative linear classifier with a $1$-norm of at most $k$, and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a $k$-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is linear in both $k$ and the size of the threshold, and that this is the best possible rate. We provide an efficient online learning algorithm that achieves the optimal rate, and show that in the batch case, empirical risk minimization achieves this rate as well. The rates we show are tighter than the uniform convergence rate, which grows with $k^2$.

preprint2016arXiv

Perfect Memory Context Trees in time series modeling

The Stochastic Context Tree (SCOT) is a useful tool for studying infinite random sequences generated by an m-Markov Chain (m-MC). It captures the phenomenon that the probability distribution of the next state sometimes depends on less than m of the preceding states. This allows compressing the information needed to describe an m-MC. The SCOT construction has been earlier used under various names: VLMC, VOMC, PST, CTW. In this paper we study the possibility of reducing the m-MC to a 1-MC on the leaves of the SCOT. Such context trees are called perfect-memory. We give various combinatorial characterizations of perfect-memory context trees and an efficient algorithm to find the minimal perfect-memory extension of a SCOT.

preprint2016arXiv

Slope inequality for families of curves over surfaces

In this paper, we investigate the general notion of the slope for families of curves $f: X \to Y$. The main result is an answer to the above question when $\dim Y = 2$, and we prove a lower bound for this new slope in this case over fields of any characteristic. Both the notion and the slope inequality are compatible with the theory for $\dim Y = 0, 1$ in a very natural way, and this gives a strong evidence that the slope for an $n$-fold fibration of curves $f: X \to Y$ may be $K_{X/Y}^n / \mathrm{ch}_{n-1}(f_* ω_{X/Y})$. Rather than the usual stability methods, the whole proof of the slope inequality here is based on a completely new method using characteristic $p>0$ geometry. A simpler version of this method yields a new proof of the slope inequality when $\dim Y = 1$.

preprint2016arXiv

Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings

One-hot CNN (convolutional neural network) has been shown to be effective for text categorization (Johnson & Zhang, 2015). We view it as a special case of a general framework which jointly trains a linear model with a non-linear feature generator consisting of `text region embedding + pooling'. Under this framework, we explore a more sophisticated region embedding method using Long Short-Term Memory (LSTM). LSTM can embed text regions of variable (and possibly large) sizes, whereas the region size needs to be fixed in a CNN. We seek effective and efficient use of LSTM for this purpose in the supervised and semi-supervised settings. The best results were obtained by combining region embeddings in the form of LSTM and convolution layers trained on unlabeled data. The results indicate that on this task, embeddings of text regions, which can convey complex concepts, are more useful than embeddings of single words in isolation. We report performances exceeding the previous best results on four benchmark datasets.

preprint2016arXiv

Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition

Symmetric positive semi-definite (SPSD) matrix approximation methods have been extensively used to speed up large-scale eigenvalue computation and kernel learning methods. The standard sketch based method, which we call the prototype model, produces relatively accurate approximations, but is inefficient on large square matrices. The Nyström method is highly efficient, but can only achieve low accuracy. In this paper we propose a novel model that we call the {\it fast SPSD matrix approximation model}. The fast model is nearly as efficient as the Nyström method and as accurate as the prototype model. We show that the fast model can potentially solve eigenvalue problems and kernel learning problems in linear time with respect to the matrix size $n$ to achieve $1+ε$ relative-error, whereas both the prototype model and the Nyström method cost at least quadratic time to attain comparable error bound. Empirical comparisons among the prototype model, the Nyström method, and our fast model demonstrate the superiority of the fast model. We also contribute new understandings of the Nyström method. The Nyström method is a special instance of our fast model and is approximation to the prototype model. Our technique can be straightforwardly applied to make the CUR matrix decomposition more efficiently computed without much affecting the accuracy.

preprint2015arXiv

Adjusting Leverage Scores by Row Weighting: A Practical Approach to Coherent Matrix Completion

Low-rank matrix completion is an important problem with extensive real-world applications. When observations are uniformly sampled from the underlying matrix entries, existing methods all require the matrix to be incoherent. This paper provides the first working method for coherent matrix completion under the standard uniform sampling model. Our approach is based on the weighted nuclear norm minimization idea proposed in several recent work, and our key contribution is a practical method to compute the weighting matrices so that the leverage scores become more uniform after weighting. Under suitable conditions, we are able to derive theoretical results, showing the effectiveness of our approach. Experiments on synthetic data show that our approach recovers highly coherent matrices with high precision, whereas the standard unweighted method fails even on noise-free data.

preprint2015arXiv

Effective Use of Word Order for Text Categorization with Convolutional Neural Networks

Convolutional neural network (CNN) is a neural network that can make use of the internal structure of data such as the 2D structure of image data. This paper studies CNN on text categorization to exploit the 1D structure (namely, word order) of text data for accurate prediction. Instead of using low-dimensional word vectors as input as is often done, we directly apply CNN to high-dimensional text data, which leads to directly learning embedding of small text regions for use in classification. In addition to a straightforward adaptation of CNN from image to text, a simple but new variation which employs bag-of-word conversion in the convolution layer is proposed. An extension to combine multiple convolution layers is also explored for higher accuracy. The experiments demonstrate the effectiveness of our approach in comparison with state-of-the-art methods.

preprint2015arXiv

Improved Analyses of the Randomized Power Method and Block Lanczos Method

The power method and block Lanczos method are popular numerical algorithms for computing the truncated singular value decomposition (SVD) and eigenvalue decomposition problems. Especially in the literature of randomized numerical linear algebra, the power method is widely applied to improve the quality of randomized sketching, and relative-error bounds have been well established. Recently, Musco & Musco (2015) proposed a block Krylov subspace method that fully exploits the intermediate results of the power iteration to accelerate convergence. They showed spectral gap-independent bounds which are stronger than the power method by order-of-magnitude. This paper offers novel error analysis techniques and significantly improves the bounds of both the randomized power method and the block Lanczos method. This paper also establishes the first gap-independent bound for the warm-start block Lanczos method.

preprint2015arXiv

On the Duality Gap Convergence of ADMM Methods

This paper provides a duality gap convergence analysis for the standard ADMM as well as a linearized version of ADMM. It is shown that under appropriate conditions, both methods achieve linear convergence. However, the standard ADMM achieves a faster accelerated convergence rate than that of the linearized ADMM. A simple numerical example is used to illustrate the difference in convergence behavior.

preprint2015arXiv

Optimal computational and statistical rates of convergence for sparse nonconvex learning problems

We provide theoretical analysis of the statistical and computational properties of penalized $M$-estimators that can be formulated as the solution to a possibly nonconvex optimization problem. Many important estimators fall in this category, including least squares regression with nonconvex regularization, generalized linear models with nonconvex regularization and sparse elliptical random design regression. For these problems, it is intractable to calculate the global solution due to the nonconvex formulation. In this paper, we propose an approximate regularization path-following method for solving a variety of learning problems with nonconvex objective functions. Under a unified analytic framework, we simultaneously provide explicit statistical and computational rates of convergence for any local solution attained by the algorithm. Computationally, our algorithm attains a global geometric rate of convergence for calculating the full regularization path, which is optimal among all first-order algorithms. Unlike most existing methods that only attain geometric rates of convergence for one single regularization parameter, our algorithm calculates the full regularization path with the same iteration complexity. In particular, we provide a refined iteration complexity bound to sharply characterize the performance of each stage along the regularization path. Statistically, we provide sharp sample complexity analysis for all the approximate local solutions along the regularization path. In particular, our analysis improves upon existing results by providing a more refined sample complexity bound as well as an exact support recovery result for the final estimator. These results show that the final estimator attains an oracle statistical property due to the usage of nonconvex penalty.

preprint2015arXiv

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

This paper presents a new semi-supervised framework with convolutional neural networks (CNNs) for text categorization. Unlike the previous approaches that rely on word embeddings, our method learns embeddings of small text regions from unlabeled data for integration into a supervised CNN. The proposed scheme for embedding learning is based on the idea of two-view semi-supervised learning, which is intended to be useful for the task of interest even though the training is done on unlabeled data. Our models achieve better results than previous approaches on sentiment classification and topic classification tasks.

preprint2015arXiv

Sparse Nonlinear Regression: Parameter Estimation and Asymptotic Inference

We study parameter estimation and asymptotic inference for sparse nonlinear regression. More specifically, we assume the data are given by $y = f( x^\top β^* ) + ε$, where $f$ is nonlinear. To recover $β^*$, we propose an $\ell_1$-regularized least-squares estimator. Unlike classical linear regression, the corresponding optimization problem is nonconvex because of the nonlinearity of $f$. In spite of the nonconvexity, we prove that under mild conditions, every stationary point of the objective enjoys an optimal statistical rate of convergence. In addition, we provide an efficient algorithm that provably converges to a stationary point. We also access the uncertainty of the obtained estimator. Specifically, based on any stationary point of the objective, we construct valid hypothesis tests and confidence intervals for the low dimensional components of the high-dimensional parameter $β^*$. Detailed numerical results are provided to back up our theory.

preprint2015arXiv

Stochastic Optimization with Importance Sampling

Uniform sampling of training data has been commonly used in traditional stochastic optimization algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA). Although uniform sampling can guarantee that the sampled stochastic quantity is an unbiased estimate of the corresponding true quantity, the resulting estimator may have a rather high variance, which negatively affects the convergence of the underlying optimization procedure. In this paper we study stochastic optimization with importance sampling, which improves the convergence rate by reducing the stochastic variance. Specifically, we study prox-SGD (actually, stochastic mirror descent) with importance sampling and prox-SDCA with importance sampling. For prox-SGD, instead of adopting uniform sampling throughout the training process, the proposed algorithm employs importance sampling to minimize the variance of the stochastic gradient. For prox-SDCA, the proposed importance sampling scheme aims to achieve higher expected dual value at each dual coordinate ascent step. We provide extensive theoretical analysis to show that the convergence rates with the proposed importance sampling methods can be significantly improved under suitable conditions both for prox-SGD and for prox-SDCA. Experiments are provided to verify the theoretical analysis.

preprint2014arXiv

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical risk minimization. We propose and analyze a new proximal stochastic gradient method, which uses a multi-stage scheme to progressively reduce the variance of the stochastic gradient. While each iteration of this algorithm has similar cost as the classical stochastic gradient method (or incremental gradient method), we show that the expected objective value converges to the optimum at a geometric rate. The overall complexity of this method is much lower than both the proximal full gradient method and the standard proximal stochastic gradient method.

preprint2014arXiv

Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

Stochastic Gradient Descent (SGD) is a popular optimization method which has been applied to many important machine learning tasks such as Support Vector Machines and Deep Neural Networks. In order to parallelize SGD, minibatch training is often employed. The standard approach is to uniformly sample a minibatch at each step, which often leads to high variance. In this paper we propose a stratified sampling strategy, which divides the whole dataset into clusters with low within-cluster variance; we then take examples from these clusters using a stratified sampling technique. It is shown that the convergence rate can be significantly improved by the algorithm. Encouraging experimental results confirm the effectiveness of the proposed method.

preprint2014arXiv

Adaptive Stochastic Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers (ADMM) has been studied for years. The traditional ADMM algorithm needs to compute, at each iteration, an (empirical) expected loss function on all training examples, resulting in a computational complexity proportional to the number of training examples. To reduce the time complexity, stochastic ADMM algorithms were proposed to replace the expected function with a random loss function associated with one uniformly drawn example plus a Bregman divergence. The Bregman divergence, however, is derived from a simple second order proximal function, the half squared norm, which could be a suboptimal choice. In this paper, we present a new family of stochastic ADMM algorithms with optimal second order proximal functions, which produce a new family of adaptive subgradient methods. We theoretically prove that their regret bounds are as good as the bounds which could be achieved by the best proximal function that can be chosen in hindsight. Encouraging empirical results on a variety of real-world datasets confirm the effectiveness and efficiency of the proposed algorithms.

preprint2014arXiv

Communication Efficient Distributed Optimization using an Approximate Newton-type Method

We present a novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems. For quadratic objectives, the method enjoys a linear rate of convergence which provably \emph{improves} with the data size, requiring an essentially constant number of iterations under reasonable assumptions. We provide theoretical and empirical evidence of the advantages of our method compared to other approaches, such as one-shot parameter averaging and ADMM.

preprint2014arXiv

Compensating the electron beam energy spread by the natural transverse gradient of laser undulator in all-optical x-ray light sources

All-optical ideas provide a potential to dramatically cut off the size and cost of x-ray light sources to the university-laboratory scale, with the combination of the laser-plasma accelerator and the laser undulator. However, the large longitudinal energy spread of the electron beam from laser-plasma accelerator may hinder the way to high brightness of these all-optical light sources. In this paper, the beam energy spread effect is proposed to be significantly compensated by the natural transverse gradient of a laser undulator when properly transverse-dispersing the electron beam. Theoretical analysis and numerical simulations on conventional laser-Compton scattering sources and high-gain all-optical x-ray free-electron lasers with the electron beams from laser-plasma accelerators are presented.

preprint2014arXiv

Effective Bounds of Linear Series on Algebraic Varieties and Arithmetic Varieties

In this paper, we prove effective upper bounds for effective sections of line bundles on projective varieties and hermitian line bundles on arithmetic varieties in terms of the volumes. They are effective versions of the Hilbert--Samuel formula and the arithmetic Hilbert--Samuel formula.

preprint2014arXiv

Enhanced Precision Through Multiple Reads for LDPC Decoding in Flash Memories

Multiple reads of the same Flash memory cell with distinct word-line voltages provide enhanced precision for LDPC decoding. In this paper, the word-line voltages are optimized by maximizing the mutual information (MI) of the quantized channel. The enhanced precision from a few additional reads allows FER performance to approach that of full-precision soft information and enables an LDPC code to significantly outperform a BCH code. A constant-ratio constraint provides a significant simplification in the optimization with no noticeable loss in performance. For a well-designed LDPC code, the quantization that maximizes the mutual information also minimizes the frame error rate in our simulations. However, for an example LDPC code with a high error floor caused by small absorbing sets, the MMI quantization does not provide the lowest frame error rate. The best quantization in this case introduces more erasures than would be optimal for the channel MI in order to mitigate the absorbing sets of the poorly designed code. The paper also identifies a trade-off in LDPC code design when decoding is performed with multiple precision levels; the best code at one level of precision will typically not be the best code at a different level of precision.

preprint2014arXiv

Experimental demonstration of longitudinal beam phase space linearizer in a free-electron laser facility by corrugated structures

Removal of residual linear energy chirp and intrinsic nonlinear energy curvature in the relativistic electron beam from radiofrequency linear accelerator is of paramount importance for efficient lasing of a high-gain free-electron laser. Recently, it was theoretically and experimentally demonstrated that the longitudinal wakefield excited by the electrons itself in the corrugated structure allows for precise control of the electron beam phase space. In this Letter, we report the first utilization of a corrugated structure as beam linearizer in the operation of a seeded free-electron laser driven by a 140 MeV linear accelerator, where a gain of ~10,000 over spontaneous emission was achieved at the second harmonic of the 1047 nm seed laser, and a free-electron laser bandwidth narrowing by about 50% was observed, in good agreement with the theoretical expectations.

preprint2014arXiv

Joint Multi-Cell Resource Allocation Using Pure Binary-Integer Programming for LTE Uplink

Due to high system capacity requirement, 3GPP Long Term Evolution (LTE) is likely to adopt frequency reuse factor 1 at the cost of suffering severe inter-cell interference (ICI). One of combating ICI strategies is network cooperation of resource allocation (RA). For LTE uplink RA, requiring all the subcarriers to be allocated adjacently complicates the RA problem greatly. This paper investigates the joint multi-cell RA problem for LTE uplink. We model the uplink RA and ICI mitigation problem using pure binary-integer programming (BIP), with integrative consideration of all users' channel state information (CSI). The advantage of the pure BIP model is that it can be solved by branch-and-bound search (BBS) algorithm or other BIP solving algorithms, rather than resorting to exhaustive search. The system-level simulation results show that it yields 14.83% and 22.13% gains over single-cell optimal RA in average spectrum efficiency and 5th percentile of user throughput, respectively.

preprint2014arXiv

Learning Nonlinear Functions Using Regularized Greedy Forest

We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman's gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting (and Adaboost with exponential loss) on many datasets.

preprint2014arXiv

Quasiparticle scattering from topological crystalline insulator SnTe (001) surface states

Recently, the topological classification of electronic states has been extended to a new class of matter known as topological crystalline insulators. Similar to topological insulators, topological crystalline insulators also have spin-momentum locked surface states; but they only exist on specific crystal planes that are protected by crystal reflection symmetry. Here, we report an ultra-low temperature scanning tunneling microscopy and spectroscopy study on topological crystalline insulator SnTe nanoplates grown by molecular beam epitaxy. We observed quasiparticle interference patterns on the SnTe (001) surface that can be interpreted in terms of electron scattering from the four Fermi pockets of the topological crystalline insulator surface states in the first surface Brillouin zone. A quantitative analysis of the energy dispersion of the quasiparticle interference intensity shows two high energy features related to the crossing point beyond the Lifshitz transition when the two neighboring low energy surface bands near the point merge. A comparison between the experimental and computed quasiparticle interference patterns reveals possible spin texture of the surface states.

preprint2014arXiv

Random design analysis of ridge regression

This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.

preprint2014arXiv

Randomized Dual Coordinate Ascent with Arbitrary Sampling

We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primal-dual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard mini-batching, our bounds predict initial data-independent speedup as well as additional data-driven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient mini-batch importance sampling. The distributed variant of Quartz is the first distributed SDCA-like method with an analysis for non-separable data.

preprint2014arXiv

Single-shot measurement of free-electron laser polarization at SDUV-FEL

In this paper, a division-of-amplitude photopolarimeter (DOAP) for measuring the polarization state of free-electron laser (FEL) pulse is described. The incident FEL beam is divided into four separate beams, and four Stokes parameters can be measured in a single-shot. In the crossed-planar undulators experiment at Shanghai deep ultraviolet FEL test facility, this DOAP instrument constructed in house responses accurately and timely while the polarization-state of fully coherent FEL pulses are switched, which is helpful for confirming the crossed-planar undulators technique for short-wavelength FELs.

preprint2014arXiv

Three-dimensional manipulation of electron beam phase space for seeding soft x-ray free-electron lasers

In this letter, a simple technique is proposed to induce strong density modulation into the electron beam with small energy modulation. By using the combination of a transversely dispersed electron beam and a wave-front tilted seed laser, three-dimensional manipulation of the electron beam phase space can be utilized to significantly enhance the micro-bunching of seeded free-electron laser schemes, which will improve the performance and extend the short-wavelength range of a single-stage seeded free-electron laser. Theoretical analysis and numerical simulations demonstrate the capability of the proposed technique in a soft x-ray free-electron laser.

preprint2013arXiv

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of \cite{nesterov2007gradient}.

preprint2013arXiv

Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization

We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an inner-outer iteration procedure. We analyze the runtime of the framework and obtain rates that improve state-of-the-art results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings.

preprint2013arXiv

Aggregation of Affine Estimators

We consider the problem of aggregating a general collection of affine estimators for fixed design regression. Relevant examples include some commonly used statistical estimators such as least squares, ridge and robust least squares estimators. Dalalyan and Salmon (2012) have established that, for this problem, exponentially weighted (EW) model selection aggregation leads to sharp oracle inequalities in expectation, but similar bounds in deviation were not previously known. While results indicate that the same aggregation scheme may not satisfy sharp oracle inequalities with high probability, we prove that a weaker notion of oracle inequality for EW that holds with high probability. Moreover, using a generalization of the newly introduced $Q$-aggregation scheme we also prove sharp oracle inequalities that hold with high probability. Finally, we apply our results to universal aggregation and show that our proposed estimator leads simultaneously to all the best known bounds for aggregation, including $\ell_q$-aggregation, $q \in (0,1)$, with high probability.

preprint2013arXiv

Compressed Counting Meets Compressed Sensing

Compressed sensing (sparse signal recovery) has been a popular and important research topic in recent years. By observing that natural signals are often nonnegative, we propose a new framework for nonnegative signal recovery using Compressed Counting (CC). CC is a technique built on maximally-skewed p-stable random projections originally developed for data stream computations. Our recovery procedure is computationally very efficient in that it requires only one linear scan of the coordinates. Our analysis demonstrates that, when 0<p<=0.5, it suffices to use M= O(C/eps^p log N) measurements so that all coordinates will be recovered within eps additive precision, in one scan of the coordinates. The constant C=1 when p->0 and C=pi/2 when p=0.5. In particular, when p->0 the required number of measurements is essentially M=K\log N, where K is the number of nonzero coordinates of the signal.

preprint2013arXiv

Electric Field Tuning of the Surface Band Structure of Topological Insulator Sb2Te3 Thin Films

We measured the response of the surface state spectrum of epitaxial Sb2Te3 thin films to applied gate electric fields by low temperature scanning tunneling microscopy. The gate dependent shift of the Fermi level and the screening effect from bulk carriers vary as a function of film thickness. We observed a gap opening at the Dirac point for films thinner than four quintuple layers, due to the coupling of the top and bottom surfaces. Moreover, the top surface state band gap of the three quintuple layer films was found to be tunable by back gate, indicating the possibility of observing a topological phase transition in this system. Our results are well explained by an effective model of 3D topological insulator thin films with structure inversion asymmetry, indicating that three quintuple layer Sb2Te3 films are topologically nontrivial and belong to the quantum spin Hall insulator class.

preprint2013arXiv

Fast Polarization Switching Demonstration Using Crossed-Planar Undulator in a Seeded Free Electron Laser

Fast polarization switching of light sources is required over a wide spectral range to investigate the symmetry of matter. In this Letter, we report the first experimental demonstration of the crossed-planar undulator technique at a seeded free-electron laser, which holds great promise for the full control and fast switching of the polarization of short-wavelength radiation. In the experiment, the polarization state of the coherent radiation at the 2nd harmonic of the seed laser is switched successfully. The experiment results confirm the theory, and pave the way for applying the crossed-planar undulator technique for the seeded X-ray free electron lasers.

preprint2013arXiv

FEL Polarization Control Studies on Dalian Coherent Light Source

The polarization switch of a free-electron laser (FEL) is of great importance to the user scientific community. In this paper, we investigate the generation of controllable polarization FEL from two well-known approaches for Dalian coherent light source, i.e., crossed planar undulator and elliptical permanent undulator. In order to perform a fair comparative study, a one-dimensional time-dependent FEL code has been developed, in which the imperfection effects of an elliptical permanent undulator are taken into account. Comprehensive simulation results indicate that the residual beam energy chirp and the intrinsic FEL gain may contribute to the degradation of the polarization performance for the crossed planar undulator. And the elliptical permanent undulator is not very sensitive to the undulator errors and beam imperfections. Meanwhile, with proper configurations of the main planar undulators and additional elliptical permanent undulator section, circular polarized FEL with pulse energy exceeds 100 $μ$J could be achieved at Dalian coherent light source.

preprint2013arXiv

Geography of irregular Gorenstein 3-folds

In this paper, we study the explicit geography problem of irregular Gorenstein minimal 3-folds of general type. We generalize the classical Noether-Castelnuovo inequalities for irregular surfaces to irregular 3-folds according to the Albanese dimension.

preprint2013arXiv

Gradient Hard Thresholding Pursuit for Sparsity-Constrained Optimization

Hard Thresholding Pursuit (HTP) is an iterative greedy selection procedure for finding sparse solutions of underdetermined linear systems. This method has been shown to have strong theoretical guarantee and impressive numerical performance. In this paper, we generalize HTP from compressive sensing to a generic problem setup of sparsity-constrained convex optimization. The proposed algorithm iterates between a standard gradient descent step and a hard thresholding step with or without debiasing. We prove that our method enjoys the strong guarantees analogous to HTP in terms of rate of convergence and parameter estimation accuracy. Numerical evidences show that our method is superior to the state-of-the-art greedy selection methods in sparse logistic regression and sparse precision matrix estimation tasks.

preprint2013arXiv

High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning

Joint sparsity regularization in multi-task learning has attracted much attention in recent years. The traditional convex formulation employs the group Lasso relaxation to achieve joint sparsity across tasks. Although this approach leads to a simple convex formulation, it suffers from several issues due to the looseness of the relaxation. To remedy this problem, we view jointly sparse multi-task learning as a specialized random effects model, and derive a convex relaxation approach that involves two steps. The first step learns the covariance matrix of the coefficients using a convex formulation which we refer to as sparse covariance coding; the second step solves a ridge regression problem with a sparse quadratic regularizer based on the covariance matrix obtained in the first step. It is shown that this approach produces an asymptotically optimal quadratic regularizer in the multitask learning setting when the number of tasks approaches infinity. Experimental results demonstrate that the convex formulation obtained via the proposed model significantly outperforms group Lasso (and related multi-stage formulations

preprint2013arXiv

Introduction to the Special Issue on Sparsity and Regularization Methods

Traditional statistical inference considers relatively small data sets and the corresponding theoretical analysis focuses on the asymptotic behavior of a statistical estimator when the number of samples approaches infinity. However, many data sets encountered in modern applications have dimensionality significantly larger than the number of training data available, and for such problems the classical statistical tools become inadequate. In order to analyze high-dimensional data, new statistical methodology and the corresponding theory have to be developed.

preprint2013arXiv

Learning Pairwise Graphical Models with Nonlinear Sufficient Statistics

We investigate a generic problem of learning pairwise exponential family graphical models with pairwise sufficient statistics defined by a global mapping function, e.g., Mercer kernels. This subclass of pairwise graphical models allow us to flexibly capture complex interactions among variables beyond pairwise product. We propose two $\ell_1$-norm penalized maximum likelihood estimators to learn the model parameters from i.i.d. samples. The first one is a joint estimator which estimates all the parameters simultaneously. The second one is a node-wise conditional estimator which estimates the parameters individually for each node. For both estimators, we show that under proper conditions the extra flexibility gained in our model comes at almost no cost of statistical and computational efficiency. We demonstrate the advantages of our model over state-of-the-art methods on synthetic and real datasets.

preprint2013arXiv

Proposal for High-harmonic EEHG Lasing at Shanghai Deep Ultra-Violet Free-electron Laser

The echo-enabled harmonic generation (EEHG) free-electron laser (FEL) has been already demonstrated at lower harmonics and the first lasing at third harmonic also has been achieved at Shanghai deep ultra-violet FEL (SDUV-FEL). While the great advantage of much higher harmonic up-conversion efficiency of EEHG over other seeded FELs only shows evidently at much higher harmonics. In this paper, we investigate the possibility of EEHG lasing at 10-th harmonic of the seed laser at SDUV-FEL, both physical designs and numerical simulations have been studied carefully. Two proposals of EEHG at 10-th harmonic have been studied respectively, i.e. with the seed lasers of the same color and two difference colors, the simulation results indicate that both approaches could be the candidate for EEHG lasing at 10-th harmonic at SDUV-FEL, meanwhile the coherent synchrotron radiation does not affect the performance of EEHG-FEL but only slightly shifts the central radiation frequency.

preprint2013arXiv

Relative Noether inequality on fibered surfaces

We prove effective upper bounds on the global sections of nef line bundles of small generic degree over a fibered surface over a field of any characteristic. It can be viewed as a relative version of the classical Noether inequality for surfaces. As a consequence, we give a new proof of the slope inequality for fibered surface without using any stability method. The treatment is essentially different from these of Xiao, Cornalba--Harris and Moriwaki. We also study the geography problem of surfaces in positive characteristics and show that the Severi inequality is true for surfaces of general type in positive characteristic whose Albanese map is generically finite. Moreover, the geography of surfaces with Albanese fibrations is studied.

preprint2013arXiv

Scanning Tunneling Microscopy of Gate Tunable Topological Insulator Bi2Se3 Thin Films

Electrical field control of the carrier density of topological insulators (TI) has greatly expanded the possible practical use of these materials. However, the combination of low temperature local probe studies and a gate tunable TI device remains challenging. We have overcome this limitation by scanning tunneling microscopy and spectroscopy measurements on in-situ molecular beam epitaxy growth of Bi2Se3 films on SrTiO3 substrates with pre-patterned electrodes. Using this gating method, we are able to shift the Fermi level of the top surface states by 250 meV on a 3 nm thick Bi2Se3 device. We report field effect studies of the surface state dispersion, band gap, and electronic structure at the Fermi level.

preprint2013arXiv

Severi inequality for varieties of maximal Albanese dimension

Let $X$ be a projective, normal, minimal and Gorenstein $n$-dimensional complex variety of general type. Suppose $X$ is of maximal Albanese dimension. We prove that $K^n_X \ge 2 n! χ(K_X)$

preprint2013arXiv

Sparse Recovery with Very Sparse Compressed Counting

Compressed sensing (sparse signal recovery) often encounters nonnegative data (e.g., images). Recently we developed the methodology of using (dense) Compressed Counting for recovering nonnegative K-sparse signals. In this paper, we adopt very sparse Compressed Counting for nonnegative signal recovery. Our design matrix is sampled from a maximally-skewed p-stable distribution (0<p<1), and we sparsify the design matrix so that on average (1-g)-fraction of the entries become zero. The idea is related to very sparse stable random projections (Li et al 2006 and Li 2007), the prior work for estimating summary statistics of the data. In our theoretical analysis, we show that, when p->0, it suffices to use M= K/(1-exp(-gK) log N measurements, so that all coordinates can be recovered in one scan of the coordinates. If g = 1 (i.e., dense design), then M = K log N. If g= 1/K or 2/K (i.e., very sparse design), then M = 1.58K log N or M = 1.16K log N. This means the design matrix can be indeed very sparse at only a minor inflation of the sample complexity. Interestingly, as p->1, the required number of measurements is essentially M = 2.7K log N, provided g= 1/K. It turns out that this result is a general worst-case bound.

preprint2013arXiv

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications.

preprint2013arXiv

Thermoelectric imaging of structural disorder in epitaxial graphene

Heat is a familiar form of energy transported from a hot side to a colder side of an object, but not a notion associated with microscopic measurements of electronic properties. A temperature difference within a material causes charge carriers, electrons or holes, to diffuse along the temperature gradient inducing a thermoelectric voltage. Here we show that local thermoelectric measurements can yield high sensitivity imaging of structural disorder on the atomic and nanometre scales. The thermopower measurement acts to amplify the variations in the local density of states at the Fermi-level, giving high differential contrast in thermoelectric signals. Using this imaging technique, we uncovered point defects in the first layer of epitaxial graphene, which generate soliton-like domain wall line patterns separating regions of the different interlayer stacking of the second graphene layer.

preprint2012arXiv

A General Framework of Dual Certificate Analysis for Structured Sparse Recovery Problems

This paper develops a general theoretical framework to analyze structured sparse recovery problems using the notation of dual certificate. Although certain aspects of the dual certificate idea have already been used in some previous work, due to the lack of a general and coherent theory, the analysis has so far only been carried out in limited scopes for specific problems. In this context the current paper makes two contributions. First, we introduce a general definition of dual certificate, which we then use to develop a unified theory of sparse recovery analysis for convex programming. Second, we present a class of structured sparsity regularization called structured Lasso for which calculations can be readily performed under our theoretical framework. This new theory includes many seemingly loosely related previous work as special cases; it also implies new results that improve existing ones even for standard formulations such as L1 regularization.

preprint2012arXiv

A General Theory of Concave Regularization for High Dimensional Sparse Estimation Problems

Concave regularization methods provide natural procedures for sparse recovery. However, they are difficult to analyze in the high dimensional setting. Only recently a few sparse recovery results have been established for some specific local solutions obtained via specialized numerical procedures. Still, the fundamental relationship between these solutions such as whether they are identical or their relationship to the global minimizer of the underlying nonconvex formulation is unknown. The current paper fills this conceptual gap by presenting a general theoretical framework showing that under appropriate conditions, the global solution of nonconvex regularization leads to desirable recovery performance; moreover, under suitable conditions, the global solution corresponds to the unique sparse local solution, which can be obtained via different numerical procedures. Under this unified framework, we present an overview of existing results and discuss their connections. The unified view of this work leads to a more satisfactory treatment of concave high dimensional sparse estimation procedures, and serves as guideline for developing further numerical procedures for concave regularization.

preprint2012arXiv

A Proximal-Gradient Homotopy Method for the Sparse Least-Squares Problem

We consider solving the $\ell_1$-regularized least-squares ($\ell_1$-LS) problem in the context of sparse recovery, for applications such as compressed sensing. The standard proximal gradient method, also known as iterative soft-thresholding when applied to this problem, has low computational cost per iteration but a rather slow convergence rate. Nevertheless, when the solution is sparse, it often exhibits fast linear convergence in the final stage. We exploit the local linear convergence using a homotopy continuation strategy, i.e., we solve the $\ell_1$-LS problem for a sequence of decreasing values of the regularization parameter, and use an approximate solution at the end of each stage to warm start the next stage. Although similar strategies have been studied in the literature, there have been no theoretical analysis of their global iteration complexity. This paper shows that under suitable assumptions for sparse recovery, the proposed homotopy strategy ensures that all iterates along the homotopy solution path are sparse. Therefore the objective function is effectively strongly convex along the solution path, and geometric convergence at each stage can be established. As a result, the overall iteration complexity of our method is $O(\log(1/ε))$ for finding an $ε$-optimal solution, which can be interpreted as global geometric rate of convergence. We also present empirical results to support our theoretical analysis.

preprint2012arXiv

A Spectral Algorithm for Learning Hidden Markov Models

Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications.

preprint2012arXiv

Analysis of a randomized approximation scheme for matrix multiplication

This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors.

preprint2012arXiv

Deviation optimal learning using greedy Q-aggregation

Given a finite family of functions, the goal of model selection aggregation is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection aggregation in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called Q-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy Q-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.

preprint2012arXiv

Discussion of "Is Bayes Posterior just Quick and Dirty Confidence?" by D. A. S. Fraser

Discussion of "Is Bayes Posterior just Quick and Dirty Confidence?" by D. A. S. Fraser [arXiv:1112.5582]

preprint2012arXiv

LDPC Decoding with Limited-Precision Soft Information in Flash Memories

This paper investigates the application of low-density parity-check (LDPC) codes to Flash memories. Multiple cell reads with distinct word-line voltages provide limited-precision soft information for the LDPC decoder. The values of the word-line voltages (also called reference voltages) are optimized by maximizing the mutual information (MI) between the input and output of the multiple-read channel. Constraining the maximum mutual-information (MMI) quantization to enforce a constant-ratio constraint provides a significant simplification with no noticeable loss in performance. Our simulation results suggest that for a well-designed LDPC code, the quantization that maximizes the mutual information will also minimize the frame error rate. However, care must be taken to design the code to perform well in the quantized channel. An LDPC code designed for a full-precision Gaussian channel may perform poorly in the quantized setting. Our LDPC code designs provide an example where quantization increases the importance of absorbing sets thus changing how the LDPC code should be optimized. Simulation results show that small increases in precision enable the LDPC code to significantly outperform a BCH code with comparable rate and block length (but without the benefit of the soft information) over a range of frame error rates.

preprint2012arXiv

Local Measurements of the Superconducting Pairing Symmetry in CuxBi2Se3

Topological superconductors represent a newly predicted phase of matter that is topologically distinct from conventional superconducting condensates of Cooper pairs. As a manifestation of their topological character, topological superconductors support solid-state realizations of Majorana fermions at their boundaries. The recently discovered superconductor CuxBi2Se3 has been theoretically proposed as an odd-parity superconductor in the time-reversal-invariant topological superconductor class and point-contact spectroscopy measurements have reported the observation of zero-bias conductance peaks corresponding to Majorana states in this material. Here we report scanning tunneling spectroscopy (STS) measurements of the superconducting energy gap in CuxBi2Se3 as a function of spatial position and applied magnetic field. The tunneling spectrum shows that the density of states at the Fermi level is fully gapped without any in-gap states. The spectrum is well described by the Bardeen-Cooper-Schrieffer (BCS) theory with a momentum independent order parameter, which suggests that Cu0.2Bi2Se3 is a classical s-wave superconductor contrary to previous expectations and measurements.

preprint2012arXiv

Mutual-Information Optimized Quantization for LDPC Decoding of Accurately Modeled Flash Data

High-capacity NAND flash memories use multi-level cells (MLCs) to store multiple bits per cell and achieve high storage densities. Higher densities cause increased raw bit error rates (BERs), which demand powerful error correcting codes. Low-density parity-check (LDPC) codes are a well-known class of capacity-approaching codes in AWGN channels. However, LDPC codes traditionally use soft information while the flash read channel provides only hard information. Low resolution soft information may be obtained by performing multiple reads per cell with distinct word-line voltages. We select the values of these word-line voltages to maximize the mutual information between the input and output of the equivalent multiple-read channel under any specified noise model. Our results show that maximum mutual-information (MMI) quantization provides better soft information for LDPC decoding given the quantization level than the constant-pdf-ratio quantization approach. We also show that adjusting the LDPC code degree distribution for the quantized setting provides a significant performance improvement.

preprint2012arXiv

Partial Gaussian Graphical Model Estimation

This paper studies the partial estimation of Gaussian graphical models from high-dimensional empirical observations. We derive a convex formulation for this problem using $\ell_1$-regularized maximum-likelihood estimation, which can be solved via a block coordinate descent algorithm. Statistical estimation performance can be established for our method. The proposed approach has competitive empirical performance compared to existing methods, as demonstrated by various experiments on synthetic and real datasets.

preprint2012arXiv

Polarization control proposal for Shanghai deep ultraviolet free electron laser

In this paper, a fully coherent radiation option with controllable polarization is proposed for Shanghai deep ultraviolet free electron laser (FEL) test facility. Intensive start-to-end simulation suggests that, the two crossed planar undulators which generate the horizontal and vertical linear polarized FEL respectively, should be placed as close as possible for avoiding the polarization performance degradation of the final combined FEL radiation. With the existence of the phase-shifter between the two crossed radiators, Fourier-Transform-Limited output radiation with 100 nJ order pulse energy, 5 ps full pulse length and circular polarization degree above 90% could be achieved.

preprint2012arXiv

Proximal Stochastic Dual Coordinate Ascent

We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including $\ell_1$ regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, state-of-the-art results.

preprint2012arXiv

Status of polarization control experiment at Shanghai deep ultraviolet free electron laser

A polarization control experiment by utilizing a pair of crossed undulators has been proposed for the Shanghai deep ultraviolet free electron laser test facility. Numerical simulations indicate that, with the electromagnetic phase-shifter located between the two crossed planar undulators, fully coherent radiation with 100 nJ order pulse energy, 5 picoseconds pulse length and circular polarization degree above 90% could be generated. The physical design study and the preparation status of the experiment are presented in the paper.

preprint2012arXiv

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines. In this paper, we investigate the performance of SGD without such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the last SGD iterate scales as O(log(T)/\sqrt{T}) for non-smooth convex objective functions, and O(log(T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in Rakhlin et al. (2011) is not as simple to implement). Finally, we provide some experimental illustrations.

preprint2011arXiv

A tail inequality for quadratic forms of subgaussian random vectors

We prove an exponential probability tail inequality for positive semidefinite quadratic forms in a subgaussian random vector. The bound is analogous to one that holds when the vector has independent Gaussian entries.

preprint2011arXiv

Dimension-free tail inequalities for sums of random matrices

We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the dimension is large or infinite. Some applications to principal component analysis and approximate matrix multiplication are given to illustrate the utility of the new bounds.

preprint2011arXiv

Efficient Optimal Learning for Contextual Bandits

We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time $\mathrm{polylog}(N)$, where $N$ is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.

preprint2011arXiv

Multi-stage Convex Relaxation for Feature Selection

A number of recent work studied the effectiveness of feature selection using Lasso. It is known that under the restricted isometry properties (RIP), Lasso does not generally lead to the exact recovery of the set of nonzero coefficients, due to the looseness of convex relaxation. This paper considers the feature selection property of nonconvex regularization, where the solution is given by a multi-stage convex relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure recovers the set of nonzero coefficients without suffering from the bias of Lasso relaxation, which complements parameter estimation results of this procedure.

preprint2011arXiv

Power-Law Decay of Standing Waves on the Surface of Topological Insulators

We propose a general theory on the standing waves (quasiparticle interference pattern) caused by the scattering of surface states off step edges in topological insulators, in which the extremal points on the constant energy contour of surface band play the dominant role. Experimentally we image the interference patterns on both Bi$_2$Te$_3$ and Bi$_2$Se$_3$ films by measuring the local density of states using a scanning tunneling microscope. The observed decay indices of the standing waves agree excellently with the theoretical prediction: In Bi$_2$Se$_3$, only a single decay index of -3/2 exists; while in Bi$_2$Te$_3$ with strongly warped surface band, it varies from -3/2 to -1/2 and finally to -1 as the energy increases. The -1/2 decay indicates that the suppression of backscattering due to time-reversal symmetry does not necessarily lead to a spatial decay rate faster than that in the conventional two-dimensional electron system. Our formalism can also explain the characteristic scattering wave vectors of the standing wave caused by non-magnetic impurities on Bi$_2$Te$_3$.

preprint2011arXiv

Sparse Recovery with Orthogonal Matching Pursuit under RIP

This paper presents a new analysis for the orthogonal matching pursuit (OMP) algorithm. It is shown that if the restricted isometry property (RIP) is satisfied at sparsity level $O(\bar{k})$, then OMP can recover a $\bar{k}$-sparse signal in 2-norm. For compressed sensing applications, this result implies that in order to uniformly recover a $\bar{k}$-sparse signal in $\Real^d$, only $O(\bar{k} \ln d)$ random projections are needed. This analysis improves earlier results on OMP that depend on stronger conditions such as mutual incoherence that can only be satisfied with $Ω(\bar{k}^2 \ln d)$ random projections.

preprint2011arXiv

Spectral Methods for Learning Multivariate Latent Tree Structure

This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics.

preprint2011arXiv

Truncated Power Method for Sparse Eigenvalue Problems

This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most $k$ non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest $k$-subgraph problem. Extensive experiments on several synthetic and real-world large scale datasets demonstrate the competitive empirical performance of our method.

preprint2010arXiv

Agnostic Active Learning Without Constraints

We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification.

preprint2010arXiv

Landau Quantization of Massless Dirac Fermions in Topological Insulator

The recent theoretical prediction and experimental realization of topological insulators (TI) has generated intense interest in this new state of quantum matter. The surface states of a three-dimensional (3D) TI such as Bi_2Te_3, Bi_2Se_3 and Sb_2Te_3 consist of a single massless Dirac cones. Crossing of the two surface state branches with opposite spins in the materials is fully protected by the time reversal (TR) symmetry at the Dirac points, which cannot be destroyed by any TR invariant perturbation. Recent advances in thin-film growth have permitted this unique two-dimensional electron system (2DES) to be probed by scanning tunneling microscopy (STM) and spectroscopy (STS). The intriguing TR symmetry protected topological states were revealed in STM experiments where the backscattering induced by non-magnetic impurities was forbidden. Here we report the Landau quantization of the topological surface states in Bi_2Se_3 in magnetic field by using STM/STS. The direct observation of the discrete Landau levels (LLs) strongly supports the 2D nature of the topological states and gives direct proof of the nondegenerate structure of LLs in TI. We demonstrate the linear dispersion of the massless Dirac fermions by the square-root dependence of LLs on magnetic field. The formation of LLs implies the high mobility of the 2DES, which has been predicted to lead to topological magneto-electric effect of the TI.

preprint2010arXiv

Robust Matrix Decomposition with Outliers

Suppose a given observation matrix can be decomposed as the sum of a low-rank matrix and a sparse matrix (outliers), and the goal is to recover these individual components from the observed sum. Such additive decompositions have applications in a variety of numerical problems including system identification, latent variable graphical modeling, and principal components analysis. We study conditions under which recovering such a decomposition is possible via a combination of $\ell_1$ norm and trace norm minimization. We are specifically interested in the question of how many outliers are allowed so that convex programming can still achieve accurate recovery, and we obtain stronger recovery guarantees than previous studies. Moreover, we do not assume that the spatial pattern of outliers is random, which stands in contrast to related analyses under such assumptions via matrix completion.

preprint2009arXiv

Experimental demonstration of the topological surface states protected by the time-reversal symmetry

We report direct imaging of standing waves of the nontrivial surface states of topological insulator Bi$_2$Te$_3$ by using a low temperature scanning tunneling microscope. The interference fringes are caused by the scattering of the topological states off Ag impurities and step edges on the Bi$_2$Te$_3$(111) surface. By studying the voltage-dependent standing wave patterns, we determine the energy dispersion $E(k)$, which confirms the Dirac cone structure of the topological states. We further show that, very different from the conventional surface states, the backscattering of the topological states by nonmagnetic impurities is completely suppressed. The absence of backscattering is a spectacular manifestation of the time-reversal symmetry, which offers a direct proof of the topological nature of the surface states.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2605.02439:author:5:tong-zhang

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.02438:author:5:tong-zhang

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.06654:author:3:tong-zhang

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.18747:author:40:tong-zhang

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2604.25917:author:9:tong-zhang