Researcher profile

Xi Yin

Xi Yin contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

Bootstrapping the Ising Model on the Lattice

We study the statistical Ising model of spins on the infinite lattice using a bootstrap method that combines spin-flip identities with positivity conditions, including reflection positivity and Griffiths inequalities, to derive rigorous two-sided bounds on spin correlators through semi-definite programming. For the 2D Ising model on the square lattice, the bootstrap bounds based on correlators supported in a 13-site diamond-shaped region determine the nearest-spin correlator to within a small window, which for a wide range of coupling and magnetic field is narrower than the precision attainable with Monte Carlo methods. We also report preliminary results of the bootstrap bounds for the 3D Ising model on the cubic lattice.

preprint2022arXiv

Exact quantization and analytic continuation

In this paper we give a streamlined derivation of the exact quantization condition (EQC) on the quantum periods of the Schrödinger problem in one dimension with a general polynomial potential, based on Wronskian relations. We further generalize the EQC to potentials with a regular singularity, describing spherical symmetric quantum mechanical systems in a given angular momentum sector. We show that the thermodynamic Bethe ansatz (TBA) equations that govern the quantum periods undergo nontrivial monodromies as the angular momentum is analytically continued between integer values in the complex plane. The TBA equations together with the EQC are checked numerically against Hamiltonian truncation at real angular momenta and couplings, and are used to explore the analytic continuation of the spectrum on the complex angular momentum plane in examples.

preprint2022arXiv

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [11]. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the game engine, such as accurate semantic maps for each frame and templated textual descriptions. Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation. We benchmark representative approaches on tasks involving video-audio-text retrieval and generation. Our dataset and code are released at: https://mugen-org.github.io/.

preprint2022arXiv

On The S-Matrix of Ising Field Theory in Two Dimensions

We explore the analytic structure of the non-perturbative S-matrix in arguably the simplest family of massive non-integrable quantum field theories: the Ising field theory (IFT) in two dimensions, which may be viewed as the Ising CFT deformed by its two relevant operators, or equivalently, the scaling limit of the Ising model in a magnetic field. Our strategy is that of collider physics: we employ Hamiltonian truncation method (TFFSA) to extract the scattering phase of the lightest particles in the elastic regime, and combine it with S-matrix bootstrap methods based on unitarity and analyticity assumptions to determine the analytic continuation of the 2 to 2 S-matrix element to the complex s-plane. Focusing primarily on the "high temperature" regime in which the IFT interpolates between that of a weakly coupled massive fermion and the E8 affine Toda theory, we will numerically determine 3-particle amplitudes, follow the evolution of poles and certain resonances of the S-matrix, and exclude the possibility of unknown wide resonances up to reasonably high energies.

preprint2022arXiv

Proactive Image Manipulation Detection

Image manipulation detection algorithms are often trained to discriminate between images manipulated with particular Generative Models (GMs) and genuine/real images, yet generalize poorly to images manipulated with GMs unseen in the training. Conventional detection algorithms receive an input image passively. By contrast, we propose a proactive scheme to image manipulation detection. Our key enabling technique is to estimate a set of templates which when added onto the real image would lead to more accurate manipulation detection. That is, a template protected real image, and its manipulated version, is better discriminated compared to the original real image vs. its manipulated one. These templates are estimated using certain constraints based on the desired properties of templates. For image manipulation detection, our proposed approach outperforms the prior work by an average precision of 16% for CycleGAN and 32% for GauGAN. Our approach is generalizable to a variety of GMs showing an improvement over prior work by an average precision of 10% averaged across 12 GMs. Our code is available at https://www.github.com/vishal3477/proactive_IMD.

preprint2021arXiv

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other thanCOCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

preprint2020arXiv

Hashing-based Non-Maximum Suppression for Crowded Object Detection

In this paper, we propose an algorithm, named hashing-based non-maximum suppression (HNMS) to efficiently suppress the non-maximum boxes for object detection. Non-maximum suppression (NMS) is an essential component to suppress the boxes at closely located locations with similar shapes. The time cost tends to be huge when the number of boxes becomes large, especially for crowded scenes. The basic idea of HNMS is to firstly map each box to a discrete code (hash cell) and then remove the boxes with lower confidences if they are in the same cell. Considering the intersection-over-union (IoU) as the metric, we propose a simple yet effective hashing algorithm, named IoUHash, which guarantees that the boxes within the same cell are close enough by a lower IoU bound. For two-stage detectors, we replace NMS in region proposal network with HNMS, and observe significant speed-up with comparable accuracy. For one-stage detectors, HNMS is used as a pre-filter to speed up the suppression with a large margin. Extensive experiments are conducted on CARPK, SKU-110K, CrowdHuman datasets to demonstrate the efficiency and effectiveness of HNMS. Code is released at \url{https://github.com/microsoft/hnms.git}.

preprint2020arXiv

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.