Source author record

Limin Wang

Limin Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision cond-mat.mtrl-sci cond-mat.str-el physics.flu-dyn cond-mat.supr-con physics.comp-ph astro-ph Computation and Language cond-mat.mes-hall cond-mat.soft Graphics physics.chem-ph

Catalog footprint

What is connected

50works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

preprint2024arXiv

Recovering 3D Human Mesh from Monocular Images: A Survey

Estimating human pose and shape from monocular images is a long-standing problem in computer vision. Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention. With the same goal of obtaining well-aligned and physically plausible mesh results, two paradigms have been developed to overcome challenges in the 2D-to-3D lifting process: i) an optimization-based paradigm, where different data terms and regularization terms are exploited as optimization objectives; and ii) a regression-based paradigm, where deep learning techniques are embraced to solve the problem in an end-to-end fashion. Meanwhile, continuous efforts are devoted to improving the quality of 3D mesh labels for a wide range of datasets. Though remarkable progress has been achieved in the past decade, the task is still challenging due to flexible body motions, diverse appearances, complex environments, and insufficient in-the-wild annotations. To the best of our knowledge, this is the first survey that focuses on the task of monocular 3D human mesh recovery. We start with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses. We also summarize datasets, evaluation metrics, and benchmark results. Open issues and future directions are discussed in the end, hoping to motivate researchers and facilitate their research in this area. A regularly updated project page can be found at https://github.com/tinatiansjz/hmr-survey.

preprint2024arXiv

VideoChat: Chat-Centric Video Understanding

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

preprint2022arXiv

AdaMixer: A Fast-Converging Query-Based Object Detector

Traditional object detectors employ the dense paradigm of scanning over locations and scales in an image. The recent query-based object detectors break this convention by decoding image features with a set of learnable queries. However, this paradigm still suffers from slow convergence, limited performance, and design complexity of extra networks between backbone and decoder. In this paper, we find that the key to these issues is the adaptability of decoders for casting queries to varying objects. Accordingly, we propose a fast-converging query-based detector, named AdaMixer, by improving the adaptability of query-based decoding processes in two aspects. First, each query adaptively samples features over space and scales based on estimated offsets, which allows AdaMixer to efficiently attend to the coherent regions of objects. Then, we dynamically decode these sampled features with an adaptive MLP-Mixer under the guidance of each query. Thanks to these two critical designs, AdaMixer enjoys architectural simplicity without requiring dense attentional encoders or explicit pyramid networks. On the challenging MS COCO benchmark, AdaMixer with ResNet-50 as the backbone, with 12 training epochs, reaches up to 45.0 AP on the validation set along with 27.9 APs in detecting small objects. With the longer training scheme, AdaMixer with ResNeXt-101-DCN and Swin-S reaches 49.5 and 51.3 AP. Our work sheds light on a simple, accurate, and fast converging architecture for query-based object detectors. The code is made available at https://github.com/MCG-NJU/AdaMixer

preprint2022arXiv

APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Classification

Aggregating neighbor features is essential for point cloud classification. In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently. Thus each point has to participate in the calculation repeatedly and generates redundant duplicates in the memory, leading to intensive computation costs and memory consumption. Meanwhile, to pursue higher accuracy, previous methods often rely on a complex local aggregator to extract fine geometric representation, which further slows down the classification pipeline. To address these issues, we propose a new local aggregator of linear complexity for point cloud classification, coined as APP. Specifically, we introduce an auxiliary container as an anchor to exchange features between the source point and the aggregating center. Each source point pushes its feature to only one auxiliary container, and each center point pulls features from only one auxiliary container. This avoids the re-computation issue of each source point. To facilitate the learning of the local structure of cloud point, we use an online normal estimation module to provide the explainable geometric information to enhance our APP modeling capability. Our built network is more efficient than all the previous baselines with a clear margin while still consuming a lower memory. Experiments on both synthetic and real datasets demonstrate that APP-Net reaches comparable accuracies to other networks. It can process more than 10,000 samples per second with less than 10GB of memory on a single GPU. We will release the code in https://github.com/MCG-NJU/APP-Net.

preprint2022arXiv

Cross-Architecture Self-supervised Video Representation Learning

In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning. CACL consists of a 3D CNN and a video transformer which are used in parallel to generate diverse positive pairs for contrastive learning. This allows the model to learn strong representations from such diverse yet meaningful pairs. Furthermore, we introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences in the temporal order. This enables the model to learn a rich temporal representation that compensates strongly to the video-level representation learned by the CACL. We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets, where our method achieves excellent performance, surpassing the state-of-the-art methods such as VideoMoCo and MoCo+BE by a large margin. The code is made available at https://github.com/guoshengcv/CACL.

preprint2022arXiv

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries. This task is challenging because only overall labels indicating the video events are provided for training. However, an event might be labeled but not appear in one of the modalities, which results in a modality-specific noisy label problem. In this work, we propose a training strategy to identify and remove modality-specific noisy labels dynamically. It is motivated by two key observations: 1) networks tend to learn clean samples first; and 2) a labeled event would appear in at least one modality. Specifically, we sort the losses of all instances within a mini-batch individually in each modality, and then select noisy samples according to the relationships between intra-modal and inter-modal losses. Besides, we also propose a simple but valid noise ratio estimation method by calculating the proportion of instances whose confidence is below a preset threshold. Our method makes large improvements over the previous state of the arts (e.g. from 60.0\% to 63.8\% in segment-level visual metric), which demonstrates the effectiveness of our approach. Code and trained models are publicly available at \url{https://github.com/MCG-NJU/JoMoLD}.

preprint2022arXiv

Logit Normalization for Long-tail Object Detection

Real-world data exhibiting skewed distributions pose a serious challenge to existing object detectors. Moreover, the samplers in detectors lead to shifted training label distributions, while the tremendous proportion of background to foreground samples severely harms foreground classification. To mitigate these issues, in this paper, we propose Logit Normalization (LogN), a simple technique to self-calibrate the classified logits of detectors in a similar way to batch normalization. In general, our LogN is training- and tuning-free (i.e. require no extra training and tuning process), model- and label distribution-agnostic (i.e. generalization to different kinds of detectors and datasets), and also plug-and-play (i.e. direct application without any bells and whistles). Extensive experiments on the LVIS dataset demonstrate superior performance of LogN to state-of-the-art methods with various detectors and backbones. We also provide in-depth studies on different aspects of our LogN. Further experiments on ImageNet-LT reveal its competitiveness and generalizability. Our LogN can serve as a strong baseline for long-tail object detection and is expected to inspire future research in this field. Code and trained models will be publicly available at https://github.com/MCG-NJU/LogN.

preprint2022arXiv

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

preprint2022arXiv

OCSampler: Compressing Videos to One Clip with Single-step Sampling

In this paper, we propose a framework named OCSampler to explore a compact yet effective video representation with one short clip for efficient video recognition. Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially. Accordingly, these policies are derived from a light-weighted skim network together with a simple yet effective policy network within one step. Moreover, we extend the proposed method with a frame number budget, enabling the framework to produce correct predictions in high confidence with as few frames as possible. Experiments on four benchmarks, i.e., ActivityNet, Mini-Kinetics, FCVID, Mini-Sports1M, demonstrate the effectiveness of our OCSampler over previous methods in terms of accuracy, theoretical computational expense, actual inference speed. We also evaluate its generalization power across different classifiers, sampled frames, and search spaces. Especially, we achieve 76.9% mAP and 21.7 GFLOPs on ActivityNet with an impressive throughput: 123.9 Videos/s on a single TITAN Xp GPU.

preprint2022arXiv

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. The main challenge of this task is perceiving various temporal variations of diverse event boundaries. To this end, this paper presents an effective and end-to-end learnable framework (DDM-Net). To tackle the diversity and complicated semantics of event boundaries, we make three notable improvements. First, we construct a feature bank to store multi-level features of space and time, prepared for difference calculation at multiple scales. Second, to alleviate inadequate temporal modeling of previous methods, we present dense difference maps (DDM) to comprehensively characterize the motion pattern. Finally, we exploit progressive attention on multi-level DDM to jointly aggregate appearance and motion clues. As a result, DDM-Net respectively achieves a significant boost of 14% and 8% on Kinetics-GEBD and TAPOS benchmark, and outperforms the top-1 winner solution of LOVEU Challenge@CVPR 2021 without bells and whistles. The state-of-the-art result demonstrates the effectiveness of richer motion representation and more sophisticated aggregation, in handling the diversity of generic event boundary detection. The code is made available at \url{https://github.com/MCG-NJU/DDM}.

preprint2022arXiv

Structured Sparse R-CNN for Direct Scene Graph Generation

Scene graph generation (SGG) is to detect object pairs with their relations in an image. Existing SGG approaches often use multi-stage pipelines to decompose this task into object detection, relation graph construction, and dense or dense-to-sparse relation prediction. Instead, from a perspective on SGG as a direct set prediction, this paper presents a simple, sparse, and unified framework, termed as Structured Sparse R-CNN. The key to our method is a set of learnable triplet queries and a structured triplet detector which could be jointly optimized from the training set in an end-to-end manner. Specifically, the triplet queries encode the general prior for object pairs with their relations, and provide an initial guess of scene graphs for subsequent refinement. The triplet detector presents a cascaded architecture to progressively refine the detected scene graphs with the customized dynamic heads. In addition, to relieve the training difficulty of our method, we propose a relaxed and enhanced training strategy based on knowledge distillation from a Siamese Sparse R-CNN. We perform experiments on several datasets: Visual Genome and Open Images V4/V6, and the results demonstrate that our method achieves the state-of-the-art performance. In addition, we also perform in-depth ablation studies to provide insights on our structured modeling in triplet detector design and training strategies. The code and models are made available at https://github.com/MCG-NJU/Structured-Sparse-RCNN.

preprint2022arXiv

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Generic event boundary detection (GEBD) is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. In this paper, we present a local context modeling and global boundary decoding approach for GEBD task. Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence. Based on them, global boundary decoding sub-network is exploited to decode event boundaries from a global view. Our proposed method achieves 85.13% F1-score on Kinetics-GEBD testing set, which achieves a more than 22% F1-score boost compared to the baseline method. The code is available at https://github.com/JackyTown/GEBD_Challenge_CVPR2022.

preprint2022arXiv

Task-specific Inconsistency Alignment for Domain Adaptive Object Detection

Detectors trained with massive labeled data often exhibit dramatic performance degradation in some particular scenarios with data distribution gap. To alleviate this problem of domain shift, conventional wisdom typically concentrates solely on reducing the discrepancy between the source and target domains via attached domain classifiers, yet ignoring the difficulty of such transferable features in coping with both classification and localization subtasks in object detection. To address this issue, in this paper, we propose Task-specific Inconsistency Alignment (TIA), by developing a new alignment mechanism in separate task spaces, improving the performance of the detector on both subtasks. Specifically, we add a set of auxiliary predictors for both classification and localization branches, and exploit their behavioral inconsistencies as finer-grained domain-specific measures. Then, we devise task-specific losses to align such cross-domain disagreement of both subtasks. By optimizing them individually, we are able to well approximate the category- and boundary-wise discrepancies in each task space, and therefore narrow them in a decoupled manner. TIA demonstrates superior results on various scenarios to the previous state-of-the-art methods. It is also observed that both the classification and localization capabilities of the detector are sufficiently strengthened, further demonstrating the effectiveness of our TIA method. Code and trained models are publicly available at https://github.com/MCG-NJU/TIA.

preprint2021arXiv

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embeddings. The code will be made available at https://github.com/MCG-NJU/CPD-Video.

preprint2020arXiv

Actions as Moving Points

The existing action tubelet detectors often depend on heuristic anchor design and placement, which might be computationally expensive and sub-optimal for precise localization. In this paper, we present a conceptually simple, computationally efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector), by treating an action instance as a trajectory of moving points. Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches: (1) Center Branch for instance center detection and action recognition, (2) Movement Branch for movement estimation at adjacent frames to form trajectories of moving points, (3) Box Branch for spatial extent detection by directly regressing bounding box size at each estimated center. These three branches work together to generate the tubelet detection results, which could be further linked to yield video-level tubes with a matching strategy. Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets. The performance gap is more evident for higher video IoU, demonstrating that our MOC-detector is particularly effective for more precise action detection. We provide the code at https://github.com/MCG-NJU/MOC-Detector.

preprint2020arXiv

Context-Aware RCNN: A Baseline for Action Detection in Videos

Video action detection approaches usually conduct actor-centric action recognition over RoI-pooled features following the standard pipeline of Faster-RCNN. In this work, we first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor, and thus higher resolution of actors contributes to better performance. However, video models require dense sampling in time to achieve accurate recognition. To fit in GPU memory, the frames to backbone network must be kept low-resolution, resulting in a coarse feature map in RoI-Pooling layer. Thus, we revisit RCNN for actor-centric action recognition via cropping and resizing image patches around actors before feature extraction with I3D deep network. Moreover, we found that expanding actor bounding boxes slightly and fusing the context features can further boost the performance. Consequently, we develop a surpringly effective baseline (Context-Aware RCNN) and it achieves new state-of-the-art results on two challenging action detection benchmarks of AVA and JHMDB. Our observations challenge the conventional wisdom of RoI-Pooling based pipeline and encourage researchers rethink the importance of resolution in actor-centric action recognition. Our approach can serve as a strong baseline for video action detection and is expected to inspire new ideas for this filed. The code is available at \url{https://github.com/MCG-NJU/CRCNN-Action}.

preprint2020arXiv

Crystalline symmetry-protected non-trivial topology in prototype compound BaAl$_4$

The BaAl$_4$ prototype crystal structure is the most populous of all structure types, and is the building block for a diverse set of sub-structures including the famous ThCr$_2$Si$_2$ family that hosts high-temperature superconductivity and numerous magnetic and strongly correlated electron systems. The MA$_4$ family of materials (M=Sr, Ba, Eu; A=Al, Ga, In) themselves present an intriguing set of ground states including charge and spin orders, but have largely been considered as uninteresting metals. Using electronic structure calculations, symmetry analysis and topological quantum chemistry techniques, we predict the exemplary compound BaAl$_4$ to harbor a three-dimensional Dirac spectrum with non-trivial topology and possible nodal lines crossing the Brillouin zone, wherein one pair of semi-Dirac points with linear dispersion along the $k_z$ direction and quadratic dispersion along the $k_x/k_y$ direction resides on the rotational axis with $C_{4v}$ point group symmetry. Electrical transport measurements reveal the presence of an extremely large, unsaturating positive magnetoresistance in BaAl$_4$ despite an uncompensated band structure, and quantum oscillations and angle-resolved photoemission spectroscopy measurements confirm the predicted multiband semimetal structure with pockets of Dirac holes and a Van Hove singularity (VHS) remarkably consistent with the theoretical prediction. We thus present BaAl$_4$ as a new topological semimetal, casting its prototype status into a new role as building block for a vast array of new topological materials.

preprint2020arXiv

Dynamic Sampling Networks for Efficient Action Recognition in Videos

The existing action recognition methods are mainly based on clip-level classifiers such as two-stream CNNs or 3D CNNs, which are trained from the randomly selected clips and applied to densely sampled clips during testing. However, this standard setting might be suboptimal for training classifiers and also requires huge computational overhead when deployed in practice. To address these issues, we propose a new framework for action recognition in videos, called {\em Dynamic Sampling Networks} (DSN), by designing a dynamic sampling module to improve the discriminative power of learned clip-level classifiers and as well increase the inference efficiency during testing. Specifically, DSN is composed of a sampling module and a classification module, whose objective is to learn a sampling policy to on-the-fly select which clips to keep and train a clip-level classifier to perform action recognition based on these selected clips, respectively. In particular, given an input video, we train an observation network in an associative reinforcement learning setting to maximize the rewards of the selected clips with a correct prediction. We perform extensive experiments to study different aspects of the DSN framework on four action recognition datasets: UCF101, HMDB51, THUMOS14, and ActivityNet v1.3. The experimental results demonstrate that DSN is able to greatly improve the inference efficiency by only using less than half of the clips, which can still obtain a slightly better or comparable recognition accuracy to the state-of-the-art approaches.

preprint2020arXiv

Finding Action Tubes with a Sparse-to-Dense Framework

The task of spatial-temporal action detection has attracted increasing attention among researchers. Existing dominant methods solve this problem by relying on short-term information and dense serial-wise detection on each individual frames or clips. Despite their effectiveness, these methods showed inadequate use of long-term information and are prone to inefficiency. In this paper, we propose for the first time, an efficient framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner. There are two key characteristics in this framework: (1) Both long-term and short-term sampled information are explicitly utilized in our spatiotemporal network, (2) A new dynamic feature sampling module (DTS) is designed to effectively approximate the tube output while keeping the system tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets, achieving promising results that are competitive to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our framework about 7.6 times more efficient than the nearest competitor.

preprint2020arXiv

Knowledge Integration Networks for Action Recognition

In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge efficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%. We further demonstrate that our KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97.8% top-1 accuracy.

preprint2020arXiv

SketchyCOCO: Image Generation from Freehand Scene Sketches

We introduce the first method for automatic image generation from scene-level freehand sketches. Our model allows for controllable image generation by specifying the synthesis goal via freehand sketches. The key contribution is an attribute vector bridged Generative Adversarial Network called EdgeGAN, which supports high visual-quality object-level image content generation without using freehand sketches as training data. We have built a large-scale composite dataset called SketchyCOCO to support and evaluate the solution. We validate our approach on the tasks of both object-level and scene-level image generation on SketchyCOCO. Through quantitative, qualitative results, human evaluation and ablation studies, we demonstrate the method's capacity to generate realistic complex scene-level images from various freehand sketches.

preprint2020arXiv

TEA: Temporal Excitation and Aggregation for Action Recognition

Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.

preprint2020arXiv

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.

preprint2016arXiv

Actionness Estimation Using Hybrid Fully Convolutional Networks

Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.

preprint2016arXiv

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.

preprint2016arXiv

Electron-hole asymmetry, Dirac fermions, and quantum magnetoresistance in BaMnBi2

We report two-dimensional quantum transport and Dirac fermions in BaMnBi2 single crystals. BaMnBi2 is a layered bad metal with highly anisotropic conductivity and magnetic order below 290 K. Magnetotransport properties, nonzero Berry phase, small cyclotronmass, and the first-principles band structure calculations indicate the presence of Dirac fermions in Bi square nets. Quantum oscillations in the Hall channel suggest the presence of both electron and hole pockets, whereas Dirac and parabolic states coexist at the Fermi level.

preprint2016arXiv

Enhanced Thermoelectric Power and Electronic Correlations in RuSe$_2$

We report the electronic structure, electric and thermal transport properties of Ru$_{1-x}$Ir$_{x}$Se$_2$ ($x \leq 0.2$). RuSe$_2$ is a semiconductor that crystallizes in a cubic pyrite unit cell. The Seebeck coefficient of RuSe$_2$ exceeds -200 $μ$V/K around 730 K. Ir substitution results in the suppression of the resistivity and the Seebeck coefficient, suggesting the removal of the peaks in density of states near the Fermi level. Ru$_{0.8}$Ir$_{0.2}$Se$_{2}$ shows a semiconductor-metal crossover at about 30 K. The magnetic field restores the semiconducting behavior. Our results indicate the importance of the electronic correlations in enhanced thermoelectricity of RuSb$_{2}$.

preprint2016arXiv

Pairing of j=3/2 fermions in half-Heusler superconductors

We theoretically consider the superconductivity of the topological half-Heusler semimetals YPtBi and LuPtBi. We show that pairing occurs between j=3/2 fermion states, which leads to qualitative differences from the conventional theory of pairing between j=1/2 states. In particular, this permits Cooper pairs with quintet or septet total angular momentum, in addition to the usual singlet and triplet states. Purely on-site interactions can generate s-wave quintet time-reversal symmetry-breaking states with topologically nontrivial point or line nodes. These local s-wave quintet pairs reveal themselves as d-wave states in momentum space. Furthermore, due to the broken inversion symmetry in these materials, the s-wave singlet state can mix with a p-wave septet state, again with topologically-stable line nodes. Our analysis lays the foundation for understanding the unconventional superconductivity of the half-Heuslers.

preprint2016arXiv

Real-time Action Recognition with Enhanced Motion Vector CNNs

The deep two-stream architecture exhibited excellent performance on video based action recognition. The most computationally expensive step in this approach comes from the calculation of optical flow which prevents it to be real-time. This paper accelerates this architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation. However, motion vector lacks fine structures, and contains noisy and inaccurate motion patterns, leading to the evident degradation of recognition performance. Our key insight for relieving this problem is that optical flow and motion vector are inherent correlated. Transferring the knowledge learned with optical flow CNN to motion vector CNN can significantly boost the performance of the latter. Specifically, we introduce three strategies for this, initialization transfer, supervision transfer and their combination. Experimental results show that our method achieves comparable recognition performance to the state-of-the-art, while our method can process 390.7 frames per second, which is 27 times faster than the original two-stream method.

preprint2016arXiv

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.

preprint2016arXiv

Transferring Object-Scene Convolutional Neural Networks for Event Recognition in Still Images

Event recognition in still images is an intriguing problem and has potential for real applications. This paper addresses the problem of event recognition by proposing a convolutional neural network that exploits knowledge of objects and scenes for event classification (OS2E-CNN). Intuitively, it stands to reason that there exists a correlation among the concepts of objects, scenes, and events. We empirically demonstrate that the recognition of objects and scenes substantially contributes to the recognition of events. Meanwhile, we propose an iterative selection method to identify a subset of object and scene classes, which help to more efficiently and effectively transfer their deep representations to event recognition. Specifically, we develop three types of transferring techniques: (1) initialization-based transferring, (2) knowledge-based transferring, and (3) data-based transferring. These newly designed transferring techniques exploit multi-task learning frameworks to incorporate extra knowledge from other networks and additional datasets into the training procedure of event CNNs. These multi-task learning frameworks turn out to be effective in reducing the effect of over-fitting and improving the generalization ability of the learned CNNs. With OS2E-CNN, we design a multi-ratio and multi-scale cropping strategy, and propose an end-to-end event recognition pipeline. We perform experiments on three event recognition benchmarks: the ChaLearn Cultural Event Recognition dataset, the Web Image Dataset for Event Recognition (WIDER), and the UIUC Sports Event dataset. The experimental results show that our proposed algorithm successfully adapts object and scene representations towards the event dataset and that it achieves the current state-of-the-art performance on these challenging datasets.

preprint2015arXiv

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).

preprint2015arXiv

Better Exploiting OS-CNNs for Better Event Recognition in Images

Event recognition from still images is one of the most important problems for image understanding. However, compared with object recognition and scene recognition, event recognition has received much less research attention in computer vision community. This paper addresses the problem of cultural event recognition in still images and focuses on applying deep learning methods on this problem. In particular, we utilize the successful architecture of Object-Scene Convolutional Neural Networks (OS-CNNs) to perform event recognition. OS-CNNs are composed of object nets and scene nets, which transfer the learned representations from the pre-trained models on large-scale object and scene recognition datasets, respectively. We propose four types of scenarios to explore OS-CNNs for event recognition by treating them as either "end-to-end event predictors" or "generic feature extractors". Our experimental results demonstrate that the global and local representations of OS-CNNs are complementary to each other. Finally, based on our investigation of OS-CNNs, we come up with a solution for the cultural event recognition track at the ICCV ChaLearn Looking at People (LAP) challenge 2015. Our team secures the third place at this challenge and our result is very close to the best performance.

preprint2015arXiv

High-temperature superconductivity stabilized by electron-hole interband coupling in collapsed tetragonal phase of KFe2As2 under high pressure

We report a high-pressure study of simultaneous low-temperature electrical resistivity and Hall effect measurements on high quality single-crystalline KFe2As2 using designer diamond anvil cell techniques with applied pressures up to 33 GPa. In the low pressure regime, we show that the superconducting transition temperature T_c finds a maximum onset value of 7 K near 2 GPa, in contrast to previous reports that find a minimum T_c and reversal of pressure dependence at this pressure. Upon applying higher pressures, this T_c is diminished until a sudden drastic enhancement occurs coincident with a first-order structural phase transition into a collapsed tetragonal phase. The appearance of a distinct superconducting phase above 13 GPa is also accompanied by a sudden reversal of dominant charge carrier sign, from hole- to electron-like, which agrees with our band calculations predicting the emergence of an electron pocket and diminishment of hole pockets upon Fermi surface reconstruction. Our results suggest the high-temperature superconducting phase in KFe2As2 is substantially enhanced by the presence of nested electron and hole pockets, providing the key ingredient of high-T_c superconductivity in iron pnictide superconductors.

preprint2015arXiv

Object-Scene Convolutional Neural Networks for Event Recognition in Images

Event recognition from still images is of great importance for image understanding. However, compared with event recognition in videos, there are much fewer research works on event recognition in images. This paper addresses the issue of event recognition from images and proposes an effective method with deep neural networks. Specifically, we design a new architecture, called Object-Scene Convolutional Neural Network (OS-CNN). This architecture is decomposed into object net and scene net, which extract useful information for event understanding from the perspective of objects and scene context, respectively. Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition. Furthermore, we find that the deep and very-deep networks are complementary to each other. Finally, based on the proposed OS-CNN and comparative study of different network architectures, we come up with a solution of five-stream CNN for the track of cultural event recognition at the ChaLearn Looking at People (LAP) challenge 2015. Our method obtains the performance of 85.5% and ranks the $1^{st}$ place in this challenge.

preprint2015arXiv

Places205-VGGNet Models for Scene Recognition

VGGNets have turned out to be effective for object recognition in still images. However, it is unable to yield good performance by directly adapting the VGGNet models trained on the ImageNet dataset for scene recognition. This report describes our implementation of training the VGGNets on the large-scale Places205 dataset. Specifically, we train three VGGNet models, namely VGGNet-11, VGGNet-13, and VGGNet-16, by using a Multi-GPU extension of Caffe toolbox with high computational efficiency. We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205. Our trained models achieve the state-of-the-art performance on these datasets and are made public available.

preprint2015arXiv

Towards Good Practices for Very Deep Two-Stream ConvNets

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.

preprint2014arXiv

Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

Video based action recognition is one of the important and challenging problems in computer vision research. Bag of Visual Words model (BoVW) with local features has become the most popular method and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW is a general pipeline to construct a global representation from a set of local features, which is mainly composed of five steps: (i) feature extraction, (ii) feature pre-processing, (iii) codebook generation, (iv) feature encoding, and (v) pooling and normalization. Many efforts have been made in each step independently in different scenarios and their effect on action recognition is still unknown. Meanwhile, video data exhibits different views of visual pattern, such as static appearance and motion dynamics. Multiple descriptors are usually extracted to represent these different views. Many feature fusion methods have been developed in other areas and their influence on action recognition has never been investigated before. This paper aims to provide a comprehensive study of all steps in BoVW and different fusion methods, and uncover some good practice to produce a state-of-the-art action recognition system. Specifically, we explore two kinds of local features, ten kinds of encoding methods, eight kinds of pooling and normalization strategies, and three kinds of fusion methods. We conclude that every step is crucial for contributing to the final recognition rate. Furthermore, based on our comprehensive study, we propose a simple yet effective representation, called hybrid representation, by exploring the complementarity of different BoVW frameworks and local descriptors. Using this representation, we obtain the state-of-the-art on the three challenging datasets: HMDB51 (61.1%), UCF50 (92.3%), and UCF101 (87.9%).

preprint2014arXiv

Lattice Boltzmann Model for The Volume-Averaged Navier-Stokes Equations

A numerical method, based on the discrete lattice Boltzmann equation, is presented for solving the volume-averaged Navier-Stokes equations. With a modified equilibrium distribution and an additional forcing term, the volume-averaged Navier-Stokes equations can be recovered from the lattice Boltzmann equation in the limit of small Mach number by the Chapman-Enskog analysis and Taylor expansion. Due to its advantages such as explicit solver and inherent parallelism, the method appears to be more competitive with traditional numerical techniques. Numerical simulations show that the proposed model can accurately reproduce both the linear and nonlinear drag effects of porosity in the fluid flow through porous media.

preprint2013arXiv

A stability condition for turbulence model: From EMMS model to EMMS-based turbulence model

The closure problem of turbulence is still a challenging issue in turbulence modeling. In this work, a stability condition is used to close turbulence. Specifically, we regard single-phase flow as a mixture of turbulent and non-turbulent fluids, separating the structure of turbulence. Subsequently, according to the picture of the turbulent eddy cascade, the energy contained in turbulent flow is decomposed into different parts and then quantified. A turbulence stability condition, similar to the principle of the energy-minimization multi-scale (EMMS) model for gas-solid systems, is formulated to close the dynamic constraint equations of turbulence, allowing the heterogeneous structural parameters of turbulence to be optimized. We call this model the `EMMS-based turbulence model', and use it to construct the corresponding turbulent viscosity coefficient. To validate the EMMS-based turbulence model, it is used to simulate two classical benchmark problems, lid-driven cavity flow and turbulent flow with forced convection in an empty room. The numerical results show that the EMMS-based turbulence model improves the accuracy of turbulence modeling due to it considers the principle of compromise in competition between viscosity and inertia.

preprint2013arXiv

Lattice Boltzmann based discrete simulation for gas-solid fluidization

Discrete particle simulation, a combined approach of computational fluid dynamics and discrete methods such as DEM (Discrete Element Method), DSMC (Direct Simulation Monte Carlo), SPH (Smoothed Particle Hydrodynamics), PIC (Particle-In-Cell), etc., is becoming a practical tool for exploring lab-scale gas-solid systems owing to the fast development of parallel computation. However, gas-solid coupling and the corresponding fluid flow solver remain immature. In this work, we propose a modified lattice Boltzmann approach to consider the effect of both the local solid volume fraction and the local relative velocity between particles and fluid, which is different from the traditional volume-averaged Navier-Stokes equations. A time-driven hard sphere algorithm is combined to simulate the motion of individual particles, in which particles interact with each other via hard-sphere collisions, the collision detection and motion of particles are performed at constant time intervals. The EMMS (energy minimization multi-scale) drag is coupled with the lattice Boltzmann based discrete particle simulation to improve the accuracy. Two typical fluidization processes, namely, a single bubble injection at incipient fluidization and particle clustering in a fast fluidized bed riser, are simulated with this approach, with the results showing a good agreement with published correlations and experimental data. The capability of the approach to capture more detailed and intrinsic characteristics of particle-fluid systems is demonstrated. The method can also be used straightforward with other solid phase solvers.

preprint2013arXiv

Lattice Boltzmann method for shape optimization of fluid distributor

This paper presents the shape optimization of a flat-type arborescent fluid distributor for the purpose of process intensification. A shape optimization algorithm based on the lattice Boltzmann method (LBM) is proposed with the objective of decreasing the flow resistance of such distributor at the constraint of constant fluid volume. Prototypes of the initial distributor as well as the optimized one are designed. Fluid distribution and hydraulic characteristics of these distributors are investigated numerically. Results show that the pressure drop of the optimized distributor is between 15.9% and 25.1% lower than that of the initial reference while keeping a uniform flow distribution, demonstrating the process intensification in fluid distributor, and suggesting the interests of the proposed optimization algorithm in engineering optimal design.

preprint2012arXiv

Large magnetothermopower effect in Dirac materials (Sr/Ca)MnBi2

We report temperature and magnetic field dependence of the thermal transport properties in single crystals of (Sr/Ca)MnBi$_2$ with linear energy dispersion. In SrMnBi$_2$ thermopower is positive, indicating hole-type carriers and the magnetic field enhances the thermopower significantly. The maximum change of thermopower is about 1600% in 9 T field and at 10 K. A negative thermopower is observed in CaMnBi$_2$ with dominant electron-type carriers and, in contrast, the magnetic field suppresses the absolute value of thermopower. First-principle band structure shows that the chemical potential is close to the Dirac-cone-like points in linear bands. The magnetic field suppresses the apparent Hall carrier density of CaMnBi$_2$ below 50 K. The large magnetothermopower effect in (Sr/Ca)MnBi$_2$ is attributed to the magnetic field shift of chemical potential

preprint2012arXiv

Magnetic States of the Two-Leg Ladder Alkali Metal Iron Selenides $A$Fe$_2$Se$_3$

Recent neutron scattering experiments addressing the magnetic state of the two-leg ladder selenide compound BaFe$_2$Se$_3$ have unveiled a dominant spin arrangement involving ferromagnetically ordered 2$\times$2 iron-superblocks, that are antiferromagnetically coupled among them (the "block-AFM" state). Using the electronic five-orbital Hubbard model, first principles techniques to calculate the electronic hopping amplitudes between irons, and the real-space Hartree-Fock approximation to handle the many-body effects, here it is shown that the exotic block-AFM state is indeed stable at realistic electronic densities close to $n \sim 6.0$. Another state (the "CX" state) with parallel spins along the rungs and antiparallel along the legs of the ladders is close in energy. This state becomes stable in other portions of the phase diagrams, such as with hole doping, as also found experimentally via neutron scattering applied to KFe$_2$Se$_3$. In addition, the present study unveils other competing magnetic phases that could be experimentally stabilized varying either $n$ chemically or the electronic bandwidth by pressure. Similar results were obtained using two-orbital models, studied here via Lanczos and DMRG techniques. A comparison of the results obtained with the realistic selenides hoppings amplitudes for BaFe$_2$Se$_3$ against those found using the hopping amplitudes for pnictides reveals several qualitative similarities, particularly at intermediate and large Hubbard couplings.

preprint2012arXiv

Two dimensional Dirac fermions and quantum magnetoresistance in CaMnBi$_2$

We report two dimensional Dirac fermions and quantum magnetoresistance in single crystals of CaMnBi$_2$. The non-zero Berry's phase, small cyclotron resonant mass and first-principle band structure suggest the existence of the Dirac fermions in the Bi square nets. The in-plane transverse magnetoresistance exhibits a crossover at a critical field $B^*$ from semiclassical weak-field $B^2$ dependence to the high-field unsaturated linear magnetoresistance ($\sim 120%$ in 9 T at 2 K) due to the quantum limit of the Dirac fermions. The temperature dependence of $B^*$ satisfies quadratic behavior, which is attributed to the splitting of linear energy dispersion in high field. Our results demonstrate the existence of two dimensional Dirac fermions in CaMnBi$_2$ with Bi square nets.

preprint2011arXiv

Coupling model analysis of interchain coupled chain dynamics of PEO in blends with PMMA

Quasielastic neutron scattering and molecular dynamics simulation data from PEO/PMMA blends found that for short times the self-dynamics of PEO chain follows the Rouse model, but at longer times past tc=1 to 2 ns it becomes slower and departs from the Rouse model in dependences on time, momentum transfer, and temperature. To explain the anomalies, others had proposed the random Rouse model (RRM) in which each monomer has different mobility taken from a broad log-normal distribution. Despite the success of the RRM, Diddens, Brodeck and Heuer [EPL, 95, 56003 (2011)] extracted the distribution of friction coefficients from the MD simulations of a PEO/PMMA blend and found the distribution is much narrower than expected from the RRM. We propose a simpler alternative explanation of the data by utilizing alone the observed crossover of PEO chain dynamics at tc. The present problem is just a special case of a general property of relaxation in interacting systems, which is the crossover from independent relaxation to coupled many-body relaxation at some tc determined by the interaction potential. The generality is brought out vividly by pointing out that the crossover also had been observed by neutron scattering from entangled chains relaxation in monodisperse homo-polymers, and from the segmental α-relaxation of PEO in blends with PMMA. The properties of all the relaxation processes in connection with the crossover are similar, despite the length-scales of the relaxation in these systems are widely different.

preprint2011arXiv

One-Fe versus Two-Fe Brillouin Zone of Fe-Based Superconductors: Creation of the Electron Pockets via Translational Symmetry Breaking

We investigate the physical effects of translational symmetry breaking in Fe-based high-temperature superconductors due to alternating anion positions. In the representative parent compounds, including the newly discovered Fe-vacancy-ordered $\mathrm{K_{0.8}Fe_{1.6}Se_2}$, an unusual change of orbital character is found across the one-Fe Brillouin zone upon unfolding the first-principles band structure and Fermi surfaces, suggesting that covering a larger one-Fe Brillouin zone is necessary in experiments. Most significantly, the electron pockets (critical to the magnetism and superconductivity) are found only created with the broken symmetry, advocating strongly its full inclusion in future studies, particularly on the debated nodal structures of the superconducting order parameter.

preprint2010arXiv

Large-Scale DNS of Gas-Solid Flow on Mole-8.5

Direct numerical simulation (DNS) for gas-solid flow is implemented on a multi-scale supercomputing system, Mole-8.5, featuring massive parallel GPU-CPU hybrid computing, for which the lattice Boltzmann method (LBM) is deployed together with the immersed moving boundary (IMB) method and discrete element method (DEM). A two-dimensional suspension with about 1,166,400 75-micron solid particles distributed in an area of 11.5cm x46cm, and a three-dimensional suspension with 129,024 solid particles in a domain of 0.384cm x1.512cm x0.384cm are fully-resolved below particle scale and distinct multi-scale heterogeneity are observed. Almost 20-fold speedup is achieved on one Nvidia C2050 GPU over one core of Intel E5520 CPU in double precision, and nearly ideal scalability is maintained when using up to 672 GPUs. The simulations demonstrate that LB-IMB-DEM modeling with parallel GPU computing may suggest a promising approach for exploring the fundamental mechanisms and constitutive laws of complex gas-solid flow, which are, so far, poorly understood in both experiments and theoretical studies.

preprint1999arXiv

The Cosmic Microwave Background Bispectrum and Inflation

We derive an expression for the non-Gaussian cosmic-microwave-background (CMB) statistic $I_l^3$ defined recently by Ferreira, Magueijo, and Górski in terms of the slow-roll-inflation parameters $ε$ and $η$. This result shows that a nonzero value of $I_l^3$ in COBE would rule out single-field slow-roll inflation. A sharp change in the slope of the inflaton potential could increase the predicted value of $I_l^3$, but not significantly. This further suggests that it will be difficult to account for such a detection in multiple-field models in which density perturbations are produced by quantum fluctuations in the scalar field driving inflation. An Appendix shows how to evaluate an integral that is needed in our calculation as well as in more general calculations of CMB bispectra.

Limin Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

50 published item(s)

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Recovering 3D Human Mesh from Monocular Images: A Survey

VideoChat: Chat-Centric Video Understanding

AdaMixer: A Fast-Converging Query-Based Object Detector

APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Classification

Cross-Architecture Self-supervised Video Representation Learning

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Logit Normalization for Long-tail Object Detection

MixFormer: End-to-End Tracking with Iterative Mixed Attention

OCSampler: Compressing Videos to One Clip with Single-step Sampling

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

Structured Sparse R-CNN for Direct Scene Graph Generation

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Task-specific Inconsistency Alignment for Domain Adaptive Object Detection

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Actions as Moving Points

Context-Aware RCNN: A Baseline for Action Detection in Videos

Crystalline symmetry-protected non-trivial topology in prototype compound BaAl$_4$

Dynamic Sampling Networks for Efficient Action Recognition in Videos

Finding Action Tubes with a Sparse-to-Dense Framework

Knowledge Integration Networks for Action Recognition

SketchyCOCO: Image Generation from Freehand Scene Sketches

TEA: Temporal Excitation and Aggregation for Action Recognition

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

Actionness Estimation Using Hybrid Fully Convolutional Networks

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

Electron-hole asymmetry, Dirac fermions, and quantum magnetoresistance in BaMnBi2

Enhanced Thermoelectric Power and Electronic Correlations in RuSe$_2$

Pairing of j=3/2 fermions in half-Heusler superconductors

Real-time Action Recognition with Enhanced Motion Vector CNNs

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Transferring Object-Scene Convolutional Neural Networks for Event Recognition in Still Images

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Better Exploiting OS-CNNs for Better Event Recognition in Images

High-temperature superconductivity stabilized by electron-hole interband coupling in collapsed tetragonal phase of KFe2As2 under high pressure

Object-Scene Convolutional Neural Networks for Event Recognition in Images

Places205-VGGNet Models for Scene Recognition

Towards Good Practices for Very Deep Two-Stream ConvNets

Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

Lattice Boltzmann Model for The Volume-Averaged Navier-Stokes Equations

A stability condition for turbulence model: From EMMS model to EMMS-based turbulence model

Lattice Boltzmann based discrete simulation for gas-solid fluidization

Lattice Boltzmann method for shape optimization of fluid distributor

Large magnetothermopower effect in Dirac materials (Sr/Ca)MnBi2

Magnetic States of the Two-Leg Ladder Alkali Metal Iron Selenides $A$Fe$_2$Se$_3$

Two dimensional Dirac fermions and quantum magnetoresistance in CaMnBi$_2$

Coupling model analysis of interchain coupled chain dynamics of PEO in blends with PMMA

One-Fe versus Two-Fe Brillouin Zone of Fe-Based Superconductors: Creation of the Electron Pockets via Translational Symmetry Breaking

Large-Scale DNS of Gas-Solid Flow on Mole-8.5

The Cosmic Microwave Background Bispectrum and Inflation