Researcher profile

Yue Cao

Yue Cao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
38works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

38 published item(s)

preprint2026arXiv

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

preprint2024arXiv

Plethystic Murnaghan-Nakayama rule via vertex operators

Based on the vertex operator realization of the Schur functions, a determinant-type plethystic Murnaghan--Nakayama rule is obtained and utilized to derive a general formula of the expansion coefficients of $s_ν$ in the plethysm product $(p_{n}\circ h_{k})s_μ$. Meanwhile, the equivalence between our algebraic rule and the combinatorial one is also established. As an application, we provide a simple way to compute the generalized Waring formula.

preprint2024arXiv

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code is made available at \url{https://github.com/baaivision/vid2vid-zero}.

preprint2023arXiv

High-throughput combinatorial approach expedites the synthesis of a lead-free relaxor ferroelectric system

Developing novel lead-free ferroelectric materials is crucial for next-generation microelectronic technologies that are energy efficient and environment friendly. However, materials discovery and property optimization are typically time-consuming due to the limited throughput of traditional synthesis methods. In this work, we use a high-throughput combinatorial synthesis approach to fabricate lead-free ferroelectric superlattices and solid solutions of (Ba0.7Ca0.3)TiO3 (BCT) and Ba(Zr0.2Ti0.8)O3 (BZT) phases with continuous variation of composition and layer thickness. High-resolution X-ray diffraction (XRD) and analytical scanning transmission electron microscopy (STEM) demonstrate high film quality and well-controlled compositional gradients. Ferroelectric and dielectric property measurements identify the optimal property point achieved at the morphotropic phase boundary (MPB) with a composition of 48BZT-52BCT. Displacement vector maps reveal that ferroelectric domain sizes are tunable by varying {BCT-BZT}N superlattice geometry. This high-throughput synthesis approach can be applied to many other material systems to expedite new materials discovery and properties optimization, allowing for the exploration of a large area of phase space within a single growth.

preprint2023arXiv

Surveys of clumps, cores, and condensations in Cygnus-X:Searching for circumstellar disks

To investigate whether disk-mediated accretion is the primary mechanism in high-mass star formation, we have established a survey of a large sample of massive dense cores within a giant molecular cloud. We used high angular resolution ($\sim 1.8''$) observations with SMA to study the dust emission and molecular line emission of about 50 massive dense cores in Cygnus-X. At a typical distance of 1.4 kpc for Cygnus-X, these massive dense cores are resolved into $\sim 2000$ au condensations. We combined the CO outflow emission and gas kinematics traced by several high-density tracers to search for disk candidates. We extracted hundreds of dust condensations from the SMA 1.3 mm dust continuum emission. The CO data show bipolar or unipolar outflow signatures toward 49 dust condensations. Among them, only 27 sources are detected in dense gas tracers, which reveals the gas kinematics, and nine sources show evidence of rotating envelopes, suggesting the existence of embedded accretion disks. The position-velocity diagrams along the velocity gradient of all rotating condensations suggest that four condensations are possible to host Keplerian-like disks. A detailed investigation of the 27 sources detected in dense gas tracers suggests that the nine disk candidates are at earlier evolutionary stages compared to the remaining 18 sources. Non-detection of rotating disks in our sample may be due to several factors, including an unknown inclination angle of the rotation axis and an early evolutionary stage of the central source, and the latter could be important, considering that young and powerful outflows could confuse the observational evidence for rotation. The detection rate of disk candidates in our sample is 1/3, which confirms that disk accretion is a viable mechanism for high-mass star formation, although it may not be the only one.

preprint2022arXiv

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Recently, open-vocabulary image classification by vision language pre-training has demonstrated incredible achievements, that the model can classify arbitrary categories without seeing additional annotated images of that category. However, it is still unclear how to make the open-vocabulary recognition work well on broader vision problems. This paper targets open-vocabulary semantic segmentation by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images. To remedy the discrepancy in processing granularity, we refuse the use of the prevalent one-stage FCN based framework, and advocate a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform open-vocabulary classification on the masked image crops which are generated in the first stage. Our experimental results show that this two-stage framework can achieve superior performance than FCN when trained only on COCO Stuff dataset and evaluated on other datasets without fine-tuning. Moreover, this simple framework also surpasses previous state-of-the-arts of zero-shot semantic segmentation by a large margin: +29.5 hIoU on the Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its simplicity and strong performance, we hope this framework to serve as a baseline to facilitate future research. The code are made publicly available at~\url{https://github.com/MendelXu/zsseg.baseline}.

preprint2022arXiv

Contrastive Information Transfer for Pre-Ranking Systems

Real-word search and recommender systems usually adopt a multi-stage ranking architecture, including matching, pre-ranking, ranking, and re-ranking. Previous works mainly focus on the ranking stage while very few focus on the pre-ranking stage. In this paper, we focus on the information transfer from ranking to pre-ranking stage. We propose a new Contrastive Information Transfer (CIT) framework to transfer useful information from ranking model to pre-ranking model. We train the pre-ranking model to distinguish the positive pair of representation from a set of positive and negative pairs with a contrastive objective. As a consequence, the pre-ranking model can make full use of rich information in ranking model's representations. The CIT framework also has the advantage of alleviating selection bias and improving the performance of recall metrics, which is crucial for pre-ranking models. We conduct extensive experiments including offline datasets and online A/B testing. Experimental results show that CIT achieves superior results than competitive models. In addition, a strict online A/B testing at one of the world's largest E-commercial platforms shows that the proposed model achieves 0.63\% improvements on CTR and 1.64\% improvements on VBR. The proposed model now has been deployed online and serves the main traffic of this system, contributing a remarkable business growth.

preprint2022arXiv

Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.

preprint2022arXiv

Correlation-Aware Deep Tracking

Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements. While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. In contrast to the Siamese-like feature extraction, our network deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it is able to suppress non-target features, resulting in instance-varying feature extraction. The output features of the search image can be directly used for predicting target locations without extra correlation step. Moreover, our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods. Extensive experiments show our method achieves the state-of-the-art results while running at real-time. Our feature networks also can be applied to existing tracking pipelines seamlessly to raise the tracking performance. Code will be available.

preprint2022arXiv

Deep Reinforcement Learning-Based Long-Range Autonomous Valet Parking for Smart Cities

In this paper, to reduce the congestion rate at the city center and increase the quality of experience (QoE) of each user, the framework of long-range autonomous valet parking (LAVP) is presented, where an Autonomous Vehicle (AV) is deployed in the city, which can pick up, drop off users at their required spots, and then drive to the car park out of city center autonomously. In this framework, we aim to minimize the overall distance of the AV, while guarantee all users are served, i.e., picking up, and dropping off users at their required spots through optimizing the path planning of the AV and number of serving time slots. To this end, we first propose a learning based algorithm, which is named as Double-Layer Ant Colony Optimization (DL-ACO) algorithm to solve the above problem in an iterative way. Then, to make the real-time decision, while consider the dynamic environment (i.e., the AV may pick up and drop off users from different locations), we further present a deep reinforcement learning (DRL) based algorithm, which is known as deep Q network (DQN). The experimental results show that the DL-ACO and DQN-based algorithms both achieve the considerable performance.

preprint2022arXiv

Global spherically symmetric solutions to degenerate compressible Navier-Stokes equations with large data and far field vacuum

We consider the initial-boundary value problem (IBVP) for the isentropic compressible Navier-Stokes equations (\textbf{CNS}) in the domain exterior to a ball in $\mathbb R^d$ $(d=2\ \text{or} \ 3)$. When viscosity coefficients are given as a constant multiple of the mass density $ρ$, based on some analysis of the nonlinear structure of this system, we prove the global existence of the unique spherically symmetric classical solution for (large) initial data with spherical symmetry and far field vacuum in some inhomogeneous Sobolev spaces. Moreover, the solutions we obtained have the conserved total mass and finite total energy. $ρ$ keeps positive in the domain considered but decays to zero in the far field, which is consistent with the facts that the total mass is conserved, and \textbf{CNS} is a model of non-dilute fluids where $ρ$ is bounded away from the vacuum. To prove the existence, on the one hand, we consider a well-designed reformulated structure by introducing some new variables, which, actually, can transfer the degeneracies of the time evolution and the viscosity to the possible singularity of some special source terms. On the other hand, it is observed that, for the spherically symmetric flow, the radial projection of the so-called effective velocity $\boldsymbol{v} =U+\nabla φ(ρ)$ ($U$ is the velocity of the fluid, and $φ(ρ)$ is a function of $ρ$ defined via the shear viscosity coefficient $μ(ρ)$: $φ'(ρ)=2μ(ρ)/ρ^2$), verifies a damped transport equation which provides the possibility to obtain its upper bound. Then combined with the BD entropy estimates, one can obtain the required uniform a priori estimates of the solution. It is worth pointing out that the frame work on the well-posedness theory established here can be applied to the shallow water equations.

preprint2022arXiv

iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade. Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition. We believe that these two learning tasks are complementary, and suggest combining them for better visual learning. We propose a deep fusion method with three adaptations that effectively bridge two learning tasks, rather than shallow fusion through naive multi-task learning. First, we modify the previous common practice in image classification, a linear classifier, with a cosine classifier which shows comparable performance. Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights. The learnt text encoder is shared between image classification and image-text alignment. Third, we enrich each class name with a description to avoid confusion between classes and make the classification method closer to the image-text alignment. We prove that this deep fusion approach performs better on a variety of visual recognition tasks and setups than the individual learning or shallow fusion approach, from zero-shot/few-shot image classification, such as the Kornblith 12-dataset benchmark, to downstream tasks of action recognition, semantic segmentation, and object detection in fine-tuning and open-vocabulary settings. The code will be available at https://github.com/weiyx16/iCAR.

preprint2022arXiv

Incorporating Semi-Supervised and Positive-Unlabeled Learning for Boosting Full Reference Image Quality Assessment

Full-reference (FR) image quality assessment (IQA) evaluates the visual quality of a distorted image by measuring its perceptual difference with pristine-quality reference, and has been widely used in low-level vision tasks. Pairwise labeled data with mean opinion score (MOS) are required in training FR-IQA model, but is time-consuming and cumbersome to collect. In contrast, unlabeled data can be easily collected from an image degradation or restoration process, making it encouraging to exploit unlabeled training data to boost FR-IQA performance. Moreover, due to the distribution inconsistency between labeled and unlabeled data, outliers may occur in unlabeled data, further increasing the training difficulty. In this paper, we suggest to incorporate semi-supervised and positive-unlabeled (PU) learning for exploiting unlabeled data while mitigating the adverse effect of outliers. Particularly, by treating all labeled data as positive samples, PU learning is leveraged to identify negative samples (i.e., outliers) from unlabeled data. Semi-supervised learning (SSL) is further deployed to exploit positive unlabeled data by dynamically generating pseudo-MOS. We adopt a dual-branch network including reference and distortion branches. Furthermore, spatial attention is introduced in the reference branch to concentrate more on the informative regions, and sliced Wasserstein distance is used for robust difference map computation to address the misalignment issues caused by images recovered by GAN models. Extensive experiments show that our method performs favorably against state-of-the-arts on the benchmark datasets PIPAL, KADID-10k, TID2013, LIVE and CSIQ.

preprint2022arXiv

Network of Star Formation: Fragmentation controlled by scale-dependent turbulent pressure and accretion onto the massive cores revealed in the Cygnus-X GMC complex

Molecular clouds have complex density structures produced by processes including turbulence and gravity. We propose a triangulation-based method to dissect the density structure of a molecular cloud and study the interactions between dense cores and their environments. In our {approach}, a Delaunay triangulation is constructed, which consists of edges connecting these cores. Starting from this construction, we study the physical connections between neighboring dense cores and the ambient environment in a systematic fashion. We apply our method to the Cygnus-X massive GMC complex and find that the core separation is related to the mean surface density by $Σ_{\rm edge} \propto l_{\rm core }^{-0.28 }$, which can be explained by {fragmentation controlled by a scale-dependent turbulent pressure (where the pressure is a function of scale, e.g. $p\sim l^{2/3}$)}. We also find that the masses of low-mass cores ($M_{\rm core} < 10\, M_{\odot}$) are determined by fragmentation, whereas massive cores ($M_{\rm core} > 10\, M_{\odot}$) grow mostly through accretion. The transition from fragmentation to accretion coincides with the transition from a log-normal core mass function (CMF) to a power-law CMF. By constructing surface density profiles measured along edges that connect neighboring cores, we find evidence that the massive cores have accreted a significant fraction of gas from their surroundings and thus depleted the gas reservoir. Our analysis reveals a picture where cores form through fragmentation controlled by scale-dependent turbulent pressure support, followed by accretion onto the massive cores, {and the method can be applied to different regions to achieve deeper understandings in the future.

preprint2022arXiv

On Data Scaling in Masked Image Modeling

An important goal of self-supervised learning is to enable model pre-training to benefit from almost unlimited data. However, one method that has recently become popular, namely masked image modeling (MIM), is suspected to be unable to benefit from larger data. In this work, we break this misconception through extensive experiments, with data scales ranging from 10\% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations. Our study reveals that: (i) Masked image modeling is also demanding on larger data. We observed that very large models got over-fitted with relatively small data; (ii) The length of training matters. Large models trained with masked image modeling can benefit from more data with longer training; (iii) The validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks. This observation allows us to pre-evaluate pre-trained models in advance without having to make costly trial-and-error assessments of downstream tasks. We hope that our findings will advance the understanding of masked image modeling in terms of scaling ability.

preprint2022arXiv

Pre-Trained Neural Language Models for Automatic Mobile App User Feedback Answer Generation

Studies show that developers&#39; answers to the mobile app users&#39; feedbacks on app stores can increase the apps&#39; star rating. To help app developers generate answers that are related to the users&#39; issues, recent studies develop models to generate the answers automatically. Aims: The app response generation models use deep neural networks and require training data. Pre-Trained neural language Models (PTM) used in Natural Language Processing (NLP) take advantage of the information they learned from a large corpora in an unsupervised manner, and can reduce the amount of required training data. In this paper, we evaluate PTMs to generate replies to the mobile app user feedbacks. Method: We train a Transformer model from scratch and fine-tune two PTMs to evaluate the generated responses, which are compared to RRGEN, a current app response model. We also evaluate the models with different portions of the training data. Results: The results on a large dataset evaluated by automatic metrics show that PTMs obtain lower scores than the baselines. However, our human evaluation confirms that PTMs can generate more relevant and meaningful responses to the posted feedbacks. Moreover, the performance of PTMs has less drop compared to other models when the amount of training data is reduced to 1/3. Conclusion: PTMs are useful in generating responses to app reviews and are more robust models to the amount of training data provided. However, the prediction time is 19X than RRGEN. This study can provide new avenues for research in adapting the PTMs for analyzing mobile app user feedbacks. Index Terms-mobile app user feedback analysis, neural pre-trained language models, automatic answer generation

preprint2022arXiv

Revealing the Dark Secrets of Masked Image Modeling

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

preprint2022arXiv

Sampling Is All You Need on Modeling Long-Term User Behaviors for CTR Prediction

Rich user behavior data has been proven to be of great value for Click-Through Rate (CTR) prediction applications, especially in industrial recommender, search, or advertising systems. However, it&#39;s non-trivial for real-world systems to make full use of long-term user behaviors due to the strict requirements of online serving time. Most previous works adopt the retrieval-based strategy, where a small number of user behaviors are retrieved first for subsequent attention. However, the retrieval-based methods are sub-optimal and would cause more or less information losses, and it&#39;s difficult to balance the effectiveness and efficiency of the retrieval algorithm. In this paper, we propose SDIM (Sampling-based Deep Interest Modeling), a simple yet effective sampling-based end-to-end approach for modeling long-term user behaviors. We sample from multiple hash functions to generate hash signatures of the candidate item and each item in the user behavior sequence, and obtain the user interest by directly gathering behavior items associated with the candidate item with the same hash signature. We show theoretically and experimentally that the proposed method performs on par with standard attention-based models on modeling long-term user behaviors, while being sizable times faster. We also introduce the deployment of SDIM in our system. Specifically, we decouple the behavior sequence hashing, which is the most time-consuming part, from the CTR model by designing a separate module named BSE (behavior Sequence Encoding). BSE is latency-free for the CTR server, enabling us to model extremely long user behaviors. Both offline and online experiments are conducted to demonstrate the effectiveness of SDIM. SDIM now has been deployed online in the search system of Meituan APP.

preprint2022arXiv

Self-supervised Learning from 100 Million Medical Images

Building accurate and robust artificial intelligence systems for medical image assessment requires not only the research and design of advanced deep learning models but also the creation of large and curated sets of annotated training examples. Constructing such datasets, however, is often very costly -- due to the complex nature of annotation tasks and the high level of expertise required for the interpretation of medical images (e.g., expert radiologists). To counter this limitation, we propose a method for self-supervised learning of rich image features based on contrastive learning and online feature clustering. For this purpose we leverage large training datasets of over 100,000,000 medical images of various modalities, including radiography, computed tomography (CT), magnetic resonance (MR) imaging and ultrasonography. We propose to use these features to guide model training in supervised and hybrid self-supervised/supervised regime on various downstream tasks. We highlight a number of advantages of this strategy on challenging image assessment problems in radiography, CT and MR: 1) Significant increase in accuracy compared to the state-of-the-art (e.g., AUC boost of 3-7% for detection of abnormalities from chest radiography scans and hemorrhage detection on brain CT); 2) Acceleration of model convergence during training by up to 85% compared to using no pretraining (e.g., 83% when training a model for detection of brain metastases in MR scans); 3) Increase in robustness to various image augmentations, such as intensity variations, rotations or scaling reflective of data variation seen in the field.

preprint2022arXiv

SimMIM: A Simple Framework for Masked Image Modeling

This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as block-wise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2-H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https://github.com/microsoft/SimMIM.

preprint2022arXiv

Swin Transformer V2: Scaling Up Capacity and Resolution

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536$\times$1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google&#39;s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time. Code is available at \url{https://github.com/microsoft/Swin-Transformer}.

preprint2021arXiv

ALMA observations of NGC 6334S. II. Subsonic and Transonic Narrow Filaments in a High-mass Star Formation Cloud

We present a study of narrow filaments toward a massive infrared dark cloud, NGC 6334S, using the Atacama Large Millimeter/submillimeter Array (ALMA). Thirteen gas filaments are identified using the H$^{13}$CO$^{+}$ line, while a single continuum filament is revealed by the continuum emission. The filaments present a compact radial distribution with a median filament width of $\sim$0.04 pc narrower than the previously proposed `quasi-universal&#39; 0.1~pc filament width. The higher spatial resolution observations and higher-density gas tracer tend to identify even narrower and lower mass filaments. The filament widths are roughly twice the size of embedded cores. The gas filaments are largely supported by thermal motions. The nonthermal motions are predominantly subsonic and transonic in both identified gas filaments and embedded cores, which may imply that stars are likely born in environments of low turbulence. A fraction of embedded objects show a narrower velocity dispersion compared with their corresponding natal filaments, which may indicate that the turbulent dissipation is taking place in these embedded cores. The physical properties (mass, mass per unit length, gas kinematics, and width) of gas filaments are analogous to those of narrow filaments found in low- to high-mass star-forming regions. The more evolved sources are found to be farther away from the filaments, a situation that may have resulted from the relative motions between the YSOs and their natal filaments.

preprint2021arXiv

ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation

We propose ParaSCI, the first large-scale paraphrase dataset in the scientific field, including 33,981 paraphrase pairs from ACL (ParaSCI-ACL) and 316,063 pairs from arXiv (ParaSCI-arXiv). Digging into characteristics and common patterns of scientific papers, we construct this dataset though intra-paper and inter-paper methods, such as collecting citations to the same paper or aggregating definitions by scientific terms. To take advantage of sentences paraphrased partially, we put up PDBERT as a general paraphrase discovering method. The major advantages of paraphrases in ParaSCI lie in the prominent length and textual diversity, which is complementary to existing paraphrase datasets. ParaSCI obtains satisfactory results on human evaluation and downstream tasks, especially long paraphrase generation.

preprint2021arXiv

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions. In this paper, we introduce pixel-level pretext tasks for learning dense feature representations. The first task directly applies contrastive learning at the pixel level. We additionally propose a pixel-to-propagation consistency task that produces better results, even surpassing the state-of-the-art approaches by a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2 mIoU when transferred to Pascal VOC object detection (C4), COCO object detection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50 backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than the previous best methods built on instance-level contrastive learning. Moreover, the pixel-level pretext tasks are found to be effective for pre-training not only regular backbone networks but also head networks used for dense downstream tasks, and are complementary to instance-level contrastive methods. These results demonstrate the strong potential of defining pretext tasks at the pixel level, and suggest a new path forward in unsupervised visual representation learning. Code is available at \url{https://github.com/zdaxie/PixPro}.

preprint2021arXiv

Surveys of Clumps, Cores, and Condensations in the Cygnus X: II. Radio Properties of the Massive Dense Cores

We have carried out a high-sensitivity and high-resolution radio continuum study towards a sample of 47 massive dense cores (MDCs) in the Cygnus X star-forming complex using the Karl G. Jansky Very Large Array, aiming to detect and characterize the radio emission associated with star-forming activities down to ~0.01 pc scales. We have detected 64 radio sources within or closely around the full width at half-maximum (FWHM) of the MDCs, of which 37 are reported for the first time. The majority of the detected radio sources are associated with dust condensations embedded within the MDCs, and they are mostly weak and compact. We are able to build spectral energy distributions for 8 sources. Two of them indicate non-thermal emission and the other six indicate thermal free-free emission. We have determined that most of the radio sources are ionized jets or winds originating from massive young stellar objects, whereas only a few sources are likely to be ultra-compact HII regions. Further quantitative analyses indicate that the radio luminosity of the detected radio sources increases along the evolution path of the MDCs.

preprint2021arXiv

The DR21(OH) Trident -- Resolving the Massive Ridge into Three Entangled Fibers As the Initial Condition of Cluster Formation

DR21(OH) ridge, the central part of a high-mass star and cluster forming hub-filament system, is resolved spatially and kinematically into three nearly parallel fibers (f1, f2, and f3) with a roughly north-south orientation, using the observations of molecular transitions of H$^{13}$CO$^+$ (1-0), N$_2$H$^+$ (1-0), and NH$_2$D (1$_{1,1}$-1$_{0,1}$) with the Combined Array for Research in Millimeter Astronomy. These fibers are all mildly supersonic ($σ_{\rm V}$ about 2 times the sound speed), having lengths around 2 pc and widths about 0.1 pc, and they entangle and conjoin in the south where the most active high-mass star formation takes place. They all have line masses 1 - 2 orders of magnitude higher than their low-mass counterparts and are gravitationally unstable both radially and axially. However, only f1 exhibits high-mass star formation all the way along the fiber, yet f2 and f3 show no signs of significant star formation in their northern parts. A large velocity gradient increasing from north to south is seen in f3, and can be well reproduced with a model of free-fall motion toward the most massive and active dense core in the region, which corroborates the global collapse of the ridge and suggests that the disruptive effects of the tidal forces may explain the inefficiency of star formation in f2 and f3. On larger scales, some of the lower-density, peripheral filaments are likely to be the outer extensions of the fibers, and provide hints on the origin of the ridge.

preprint2020arXiv

A Closer Look at Local Aggregation Operators in Point Cloud Analysis

Recent advances of network architecture for point cloud processing are mainly driven by new designs of local aggregation operators. However, the impact of these operators to network performance is not carefully investigated due to different overall network architecture and implementation details in each solution. Meanwhile, most of operators are only applied in shallow architectures. In this paper, we revisit the representative local aggregation operators and study their performance using the same deep residual architecture. Our investigation reveals that despite the different designs of these operators, all of these operators make surprisingly similar contributions to the network performance under the same network input and feature numbers and result in the state-of-the-art accuracy on standard benchmarks. This finding stimulate us to rethink the necessity of sophisticated design of local aggregation operator for point cloud processing. To this end, we propose a simple local aggregation operator without learnable weights, named Position Pooling (PosPool), which performs similarly or slightly better than existing sophisticated operators. In particular, a simple deep residual network with PosPool layers achieves outstanding performance on all benchmarks, which outperforms the previous state-of-the methods on the challenging PartNet datasets by a large margin (7.4 mIoU). The code is publicly available at https://github.com/zeliu98/CloserLook3D

preprint2020arXiv

Complete Strain Mapping of Nanosheets of Tantalum Disulfide

Quasi-two-dimensional (quasi-2D) materials hold promise for future electronics because of their unique band structures that result in electronic and mechanical properties sensitive to crystal strains in all three dimensions. Quantifying crystal strain is a prerequisite to correlating it with the performance of the device, and calls for high resolution but spatially resolved rapid characterization methods. Here we show that using fly-scan nano X-ray diffraction we can accomplish a tensile strain sensitivity below 0.001% with a spatial resolution of better than 80 nm over a spatial extent of 100 $μ$m on quasi 2D flakes of 1T-TaS2. Coherent diffraction patterns were collected from a $\sim$ 100 nm thick sheet of 1T-TaS2 by scanning 12keV focused X-ray beam across and rotating the sample. We demonstrate that the strain distribution around micron and sub-micron sized &#39;bubbles&#39; that are present in the sample may be reconstructed from these images. The experiments use state of the art synchrotron instrumentation, and will allow rapid and non-intrusive strain mapping of thin film samples and electronic devices based on quasi 2D materials.

preprint2020arXiv

Disentangled Non-Local Neural Networks

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, such as semantic segmentation on Cityscapes, ADE20K and PASCAL Context, object detection on COCO, and action recognition on Kinetics.

preprint2020arXiv

DR 21 South Filament: a Parsec-sized Dense Gas Accretion Flow onto the DR 21 Massive Young Cluster

DR21 south filament (DR21SF) is a unique component of the giant network of filamentary molecular clouds in the north region of Cygnus X complex. Unlike the highly fragmented and star-forming active environment it resides, DR21SF exhibits a coherent profile in the column density map with very few star formation signposts, even though the previously reported linear density of the filament is an order of magnitude higher than the thermal stable threshold. We derive the size (3.6~pc by 0.13~pc), temperature (10 to 15~K), and mass (1048~\textit{M$_\odot$}) of DR21SF from Shanghai 65 m TianMa Radio Telescope (TMRT) observations of NH$_3$ (1, 1) and (2, 2) inversion lines in conjunction with the column density map from our previous work. Star-forming sites are identified along the filament where gas temperature excesses. We find clear gradients in radial velocity and intrinsic line-width along the spine of the filament. The gradients can be well interpreted with a scenario of an accretion flow feeding DR 21 at a mass transfer rate of $1.1 \times 10^{-3}$~\textit{M$_\odot$} yr$^{-1}$. Based on the analysis of its kinematic temperature, intrinsic line-width and mass distribution, we conclude that DR21SF is in an overall trans-critical status, which indicates an early evolutionary stage.

preprint2020arXiv

Memory Enhanced Global-Local Aggregation for Video Object Detection

How do humans recognize an object in a piece of video? Due to the deteriorated quality of single frame, it may be hard for people to identify an occluded object in this frame by just utilizing information within one image. We argue that there are two important cues for humans to recognize objects in videos: the global semantic information and the local localization information. Recently, plenty of methods adopt the self-attention mechanisms to enhance the features in key frame with either global semantic information or local localization information. In this paper we introduce memory enhanced global-local aggregation (MEGA) network, which is among the first trials that takes full consideration of both global and local information. Furthermore, empowered by a novel and carefully-designed Long Range Memory (LRM) module, our proposed MEGA could enable the key frame to get access to much more content than any previous methods. Enhanced by these two sources of information, our method achieves state-of-the-art performance on ImageNet VID dataset. Code is available at \url{https://github.com/Scalsol/mega.pytorch}.

preprint2020arXiv

Negative Margin Matters: Understanding Margin in Few-shot Classification

This paper introduces a negative margin loss to metric learning based few-shot learning methods. The negative margin loss significantly outperforms regular softmax loss, and achieves state-of-the-art accuracy on three standard few-shot classification benchmarks with few bells and whistles. These results are contrary to the common practice in the metric learning field, that the margin is zero or positive. To understand why the negative margin loss performs well for the few-shot classification, we analyze the discriminability of learned features w.r.t different margins for training and novel classes, both empirically and theoretically. We find that although negative margin reduces the feature discriminability for training classes, it may also avoid falsely mapping samples of the same novel class to multiple peaks or clusters, and thus benefit the discrimination of novel classes. Code is available at https://github.com/bl0/negative-margin.few-shot.

preprint2020arXiv

NTIRE 2020 Challenge on Real Image Denoising: Dataset, Methods and Results

This paper reviews the NTIRE 2020 challenge on real image denoising with focus on the newly introduced dataset, the proposed methods and their results. The challenge is a new version of the previous NTIRE 2019 challenge on real image denoising that was based on the SIDD benchmark. This challenge is based on a newly collected validation and testing image datasets, and hence, named SIDD+. This challenge has two tracks for quantitatively evaluating image denoising performance in (1) the Bayer-pattern rawRGB and (2) the standard RGB (sRGB) color spaces. Each track ~250 registered participants. A total of 22 teams, proposing 24 methods, competed in the final phase of the challenge. The proposed methods by the participating teams represent the current state-of-the-art performance in image denoising targeting real noisy images. The newly collected SIDD+ datasets are publicly available at: https://bit.ly/siddplus_data.

preprint2020arXiv

Parametric Instance Classification for Unsupervised Visual Feature Learning

This paper presents parametric instance classification (PIC) for unsupervised visual feature learning. Unlike the state-of-the-art approaches which do instance discrimination in a dual-branch non-parametric fashion, PIC directly performs a one-branch parametric instance classification, revealing a simple framework similar to supervised classification and without the need to address the information leakage issue. We show that the simple PIC framework can be as effective as the state-of-the-art approaches, i.e. SimCLR and MoCo v2, by adapting several common component settings used in the state-of-the-art approaches. We also propose two novel techniques to further improve effectiveness and practicality of PIC: 1) a sliding-window data scheduler, instead of the previous epoch-based data scheduler, which addresses the extremely infrequent instance visiting issue in PIC and improves the effectiveness; 2) a negative sampling and weight update correction approach to reduce the training time and GPU memory consumption, which also enables application of PIC to almost unlimited training images. We hope that the PIC framework can serve as a simple baseline to facilitate future study.

preprint2020arXiv

Quantifying and Leveraging Predictive Uncertainty for Medical Image Assessment

The interpretation of medical images is a challenging task, often complicated by the presence of artifacts, occlusions, limited contrast and more. Most notable is the case of chest radiography, where there is a high inter-rater variability in the detection and classification of abnormalities. This is largely due to inconclusive evidence in the data or subjective definitions of disease appearance. An additional example is the classification of anatomical views based on 2D Ultrasound images. Often, the anatomical context captured in a frame is not sufficient to recognize the underlying anatomy. Current machine learning solutions for these problems are typically limited to providing probabilistic predictions, relying on the capacity of underlying models to adapt to limited information and the high degree of label noise. In practice, however, this leads to overconfident systems with poor generalization on unseen data. To account for this, we propose a system that learns not only the probabilistic estimate for classification, but also an explicit uncertainty measure which captures the confidence of the system in the predicted output. We argue that this approach is essential to account for the inherent ambiguity characteristic of medical images from different radiologic exams including computed radiography, ultrasonography and magnetic resonance imaging. In our experiments we demonstrate that sample rejection based on the predicted uncertainty can significantly improve the ROC-AUC for various tasks, e.g., by 8% to 0.91 with an expected rejection rate of under 25% for the classification of different abnormalities in chest radiographs. In addition, we show that using uncertainty-driven bootstrapping to filter the training data, one can achieve a significant increase in robustness and accuracy.

preprint2020arXiv

RepPoints V2: Verification Meets Regression for Object Detection

Verification and regression are two general methodologies for prediction in neural networks. Each has its own strengths: verification can be easier to infer accurately, and regression is more efficient and applicable to continuous target variables. Hence, it is often beneficial to carefully combine them to take advantage of their benefits. In this paper, we take this philosophy to improve state-of-the-art object detection, specifically by RepPoints. Though RepPoints provides high performance, we find that its heavy reliance on regression for object localization leaves room for improvement. We introduce verification tasks into the localization prediction of RepPoints, producing RepPoints v2, which provides consistent improvements of about 2.0 mAP over the original RepPoints on the COCO object detection benchmark using different backbones and training methods. RepPoints v2 also achieves 52.1 mAP on COCO \texttt{test-dev} by a single model. Moreover, we show that the proposed approach can more generally elevate other object detection frameworks as well as applications such as instance segmentation. The code is available at https://github.com/Scalsol/RepPointsV2.

preprint2020arXiv

Unpaired Learning of Deep Image Denoising

We investigate the task of learning blind image denoising networks from an unpaired set of clean and noisy images. Such problem setting generally is practical and valuable considering that it is feasible to collect unpaired noisy and clean images in most real-world applications. And we further assume that the noise can be signal dependent but is spatially uncorrelated. In order to facilitate unpaired learning of denoising network, this paper presents a two-stage scheme by incorporating self-supervised learning and knowledge distillation. For self-supervised learning, we suggest a dilated blind-spot network (D-BSN) to learn denoising solely from real noisy images. Due to the spatial independence of noise, we adopt a network by stacking 1x1 convolution layers to estimate the noise level map for each image. Both the D-BSN and image-specific noise model (CNN\_est) can be jointly trained via maximizing the constrained log-likelihood. Given the output of D-BSN and estimated noise level map, improved denoising performance can be further obtained based on the Bayes&#39; rule. As for knowledge distillation, we first apply the learned noise models to clean images to synthesize a paired set of training images, and use the real noisy images and the corresponding denoising results in the first stage to form another paired set. Then, the ultimate denoising model can be distilled by training an existing denoising network using these two paired sets. Experiments show that our unpaired learning method performs favorably on both synthetic noisy images and real-world noisy photographs in terms of quantitative and qualitative evaluation.

preprint2020arXiv

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.