Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
29works
0followers
24topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

29 published item(s)

preprint2026arXiv

Towards Robust Sequential Decomposition for Complex Image Editing

Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.

preprint2025arXiv

Empower Low-Altitude Economy: A Reliability-Aware Dynamic Weighting Allocation for Multi-modal UAV Beam Prediction

The low-altitude economy (LAE) is rapidly expanding driven by urban air mobility, logistics drones, and aerial sensing, while fast and accurate beam prediction in uncrewed aerial vehicles (UAVs) communications is crucial for achieving reliable connectivity. Current research is shifting from single-signal to multi-modal collaborative approaches. However, existing multi-modal methods mostly employ fixed or empirical weights, assuming equal reliability across modalities at any given moment. Indeed, the importance of different modalities fluctuates dramatically with UAV motion scenarios, and static weighting amplifies the negative impact of degraded modalities. Furthermore, modal mismatch and weak alignment further undermine cross-scenario generalization. To this end, we propose a reliability-aware dynamic weighting scheme applied to a semantic-aware multi-modal beam prediction framework, named SaM2B. Specifically, SaM2B leverages lightweight cues such as environmental visual, flight posture, and geospatial data to adaptively allocate contributions across modalities at different time points through reliability-aware dynamic weight updates. Moreover, by utilizing cross-modal contrastive learning, we align the "multi-source representation beam semantics" associated with specific beam information to a shared semantic space, thereby enhancing discriminative power and robustness under modal noise and distribution shifts. Experiments on real-world low-altitude UAV datasets show that SaM2B achieves more satisfactory results than baseline methods.

preprint2022arXiv

A New Knowledge Distillation Network for Incremental Few-Shot Surface Defect Detection

Surface defect detection is one of the most essential processes for industrial quality inspection. Deep learning-based surface defect detection methods have shown great potential. However, the well-performed models usually require large training data and can only detect defects that appeared in the training stage. When facing incremental few-shot data, defect detection models inevitably suffer from catastrophic forgetting and misclassification problem. To solve these problems, this paper proposes a new knowledge distillation network, called Dual Knowledge Align Network (DKAN). The proposed DKAN method follows a pretraining-finetuning transfer learning paradigm and a knowledge distillation framework is designed for fine-tuning. Specifically, an Incremental RCNN is proposed to achieve decoupled stable feature representation of different categories. Under this framework, a Feature Knowledge Align (FKA) loss is designed between class-agnostic feature maps to deal with catastrophic forgetting problems, and a Logit Knowledge Align (LKA) loss is deployed between logit distributions to tackle misclassification problems. Experiments have been conducted on the incremental Few-shot NEU-DET dataset and results show that DKAN outperforms other methods on various few-shot scenes, up to 6.65% on the mean Average Precision metric, which proves the effectiveness of the proposed method.

preprint2022arXiv

AVATAR: Unconstrained Audiovisual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.

preprint2022arXiv

Axion Echos from the Supernova Graveyard

Stimulated decays of axion dark matter, triggered by a source in the sky, could produce a photon flux along the continuation of the line of sight, pointing backward to the source. The strength of this so-called axion "echo" signal depends on the entire history of the source and could still be strong from sources that are dim today but had a large flux density in the past, such as supernova remnants (SNRs). This echo signal turns out to be most observable in the radio band. We study the sensitivity of radio telescopes such as the Square Kilometer Array (SKA) to echo signals generated by SNRs that have already been observed, and show that SKA could reach axion-photon couplings of order $g_{aγγ} \sim \mathcal{O}(10^{-11}) \,\mathrm{GeV}^{-1}$ for axion masses $m_a \lesssim 10^{-5}\;\mathrm{eV}$. In addition, we show projections of the detection reach for signals coming from old SNRs and from newly born supernovae that could be detected in the future. Intriguingly, an observable echo signal could come from old "ghost" SNRs which were very bright in the past but are now so dim that they haven't been observed.

preprint2022arXiv

Beyond Transfer Learning: Co-finetuning for Action Localisation

Transfer learning is the predominant paradigm for training deep networks on small target datasets. Models are typically pretrained on large ``upstream'' datasets for classification, as such labels are easy to collect, and then finetuned on ``downstream'' tasks such as action localisation, which are smaller due to their finer-grained annotations. In this paper, we question this approach, and propose co-finetuning -- simultaneously training a single model on multiple ``upstream'' and ``downstream'' tasks. We demonstrate that co-finetuning outperforms traditional transfer learning when using the same total amount of data, and also show how we can easily extend our approach to multiple ``upstream'' datasets to further improve performance. In particular, co-finetuning significantly improves the performance on rare classes in our downstream task, as it has a regularising effect, and enables the network to learn feature representations that transfer between different datasets. Finally, we observe how co-finetuning with public, video classification datasets, we are able to achieve state-of-the-art results for spatio-temporal action localisation on the challenging AVA and AVA-Kinetics datasets, outperforming recent works which develop intricate models.

preprint2022arXiv

Coherent reaction between molecular and atomic Bose-Einstein condensates: integrable model

We solve a model that describes a stimulated conversion between ultracold bosonic atoms and molecules. The reaction is triggered by a linearly time-dependent transition throughout the Feshbach resonance. Our solution predicts a nonexponential dependence, with a dynamic phase transition, of the reaction efficiency on the transition rate. We find that the emerging phase can have a thermalized energy distribution with the temperature defined by the rate of the transition. This phase, however, has strong purely quantum correlations.

preprint2022arXiv

Constraints on Axions from Cosmic Distance Measurements

Axion couplings to photons could induce photon-axion conversion in the presence of magnetic fields in the Universe. This conversion could impact various cosmic distance measurements, such as luminosity distances to type Ia supernovae and angular distances to galaxy clusters, in different ways. In this paper we consider different combinations of the most up-to-date distance measurements to constrain the axion-photon coupling. Employing the conservative cell magnetic field model for the magnetic fields in the intergalactic medium (IGM) and ignoring the conversion in the intracluster medium (ICM), we find the upper bounds on axion-photon couplings to be around $5 \times 10^{-12}$ (nG/$B$) $\sqrt{\mathrm{Mpc/s}}$ GeV$^{-1}$ for axion masses $m_a$ below $10^{-13}$ eV, where $B$ is the strength of the IGM magnetic field, and $s$ is the comoving size of the magnetic domains. When including the conversion in the ICM, the upper bound is lowered and could reach $5 \times 10^{-13}\, $GeV$^{-1}$ for $m_a < 5 \times 10^{-12}$ eV. While this stronger bound depends on the ICM modeling, it is independent of the strength of the IGM magnetic field, for which there is no direct evidence yet. These constraints could be placed on firmer footing with an enhanced understanding and control of the astrophysical uncertainties associated with the IGM and ICM. All the bounds are determined by the shape of the Hubble rate as a function of redshift reconstructable from various distance measurements, and insensitive to today&#39;s Hubble rate, of which there is a tension between early and late cosmological measurements. As an appendix, we discuss the model building challenges of the use of photon-axion conversion to make type Ia supernovae brighter to alleviate the Hubble problem/crisis.

preprint2022arXiv

Do Trajectories Encode Verb Meaning?

Distributional models learn representations of words from text, but are criticized for their lack of grounding, or the linking of text to the non-linguistic world. Grounded language models have had success in learning to connect concrete categories like nouns and adjectives to the world via images and videos, but can struggle to isolate the meaning of the verbs themselves from the context in which they typically occur. In this paper, we investigate the extent to which trajectories (i.e. the position and rotation of objects over time) naturally encode verb semantics. We build a procedurally generated agent-object-interaction dataset, obtain human annotations for the verbs that occur in this data, and compare several methods for representation learning given the trajectories. We find that trajectories correlate as-is with some verbs (e.g., fall), and that additional abstraction via self-supervised pretraining can further capture nuanced differences in verb meaning (e.g., roll vs. slide).

preprint2022arXiv

Galactic rotation curves versus ultralight dark matter: A systematic comparison with SPARC data

We look for and place observational constraints on the imprint of ultralight dark matter (ULDM) soliton cores in rotation-dominated galaxies. Extending previous analyses, we find a conservative constraint which disfavors the soliton-host halo relation found in some numerical simulations over a broad range in the ULDM particle mass $m$. Combining the observational constraints with theoretical arguments for the efficiency of soliton formation via gravitational dynamical relaxation, and assuming that the soliton-halo relation is correct, our results disfavor ULDM from comprising 100\% of the total cosmological dark matter in the range $10^{-24}~{\rm eV}\lesssim m\lesssim10^{-20}~{\rm eV}$. The constraints probe the ULDM fraction down to $f\lesssim0.3$ of the total dark matter.

preprint2022arXiv

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.

preprint2022arXiv

Many-body slow quench dynamics and nonadiabatic characterization of topological phases

Previous studies have shown that the bulk topology of single-particle systems can be captured by the band inversion surface or by the spin inversion surface emerged on the time-averaged spin polarization. Most of the studies, however, are based on the single-particle picture even though the systems are fermionic and of multi-bands. Here, we study the many-body quench dynamics of topological systems with all the valence bands fully occupied, and show that the concepts of band inversion surface and spin inversion surface are still valid. More importantly, the many-body quench dynamics is shown to be reduced to a nontrivial three-level Landau-Zener model, which can be solved exactly. Based on the analytical results, the topological spin texture revealed by the time-averaged spin polarization can be applied to characterize the bulk topology and thus provides a direct comparison for future experiments.

preprint2022arXiv

Multiview Transformers for Video Recognition

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at: https://github.com/google-research/scenic/tree/main/scenic/projects/mtv.

preprint2022arXiv

No-go rules for multitime Landau-Zener models

Multitime Landau-Zener (MTLZ) model is a class of exactly solvable quantum many-body models which is multitstate and multitime generalization of the two-state Landau-Zener model. Currently discovered MTLZ models include the &#34;hypercubes&#34;, the &#34;fans&#34; and their direct product models. In this work, we prove two no-go rules, named the &#34;no $K_{3,3}$&#34; rule and the &#34;no $1221$&#34; rule, which forbid the existence of exact solutions for models with certain structures of interactions. We further apply these rules to show that for models with no more than $9$ states, besides the models mentioned above there are no other MTLZ models. We also propose a scheme to systematically classify cases that could possibly host MTLZ models. Our work could serve as a guideline to search for new exactly solvable models within the MTLZ class.

preprint2022arXiv

Spin Accumulation and Longitudinal Spin Diffusion of Magnets

We extend to the longitudinal component of the magnetization the spintronics idea that a magnet near equilibrium can be described by two magnetic variables. One is the usual magnetization $\vec{M}$. The other is the non-equilibrium quantity $\vec{m}$, called the spin accumulation, by which the non-equilibrium spin current can be transported. $\vec{M}$ represents a correlated distribution of a very large number of degrees of freedom, as expressed in some equilibrium distribution function for the excitations; we therefore forbid $\vec{M}$ to diffuse, but we permit $\vec{M}$ to decay. On the other hand, we permit $\vec{m}$, due to spin excitations, to both diffuse and decay. For this physical picture, diffusion from a given region occurs by decay of $\vec{M}$ to $\vec{m}$, then by diffusion of $\vec{m}$, and finally by decay of $\vec{m}$ to $\vec{M}$ in another region. This somewhat slows down the diffusion process. Restricting ourselves to the longitudinal variables $M$ and $m$ with equilibrium properties $M_{eq}=M_{0}+χ_{M\parallel}H$ and $m_{eq}=0$, we argue that the effective energy density must include a new, thermodynamically required exchange constant $λ_{M}=-1/χ_{M\parallel}$. We then develop the macroscopic equations by applying Onsager&#39;s irreversible thermodynamics, and use the resulting equations to study the space and time response. At fixed real frequency $ω$ there is, as usual, a single pair of complex wavevectors $\pm k$ but with an unusual dependence on $ω$. At fixed real wavevector, there are two decay constants, as opposed to one in the usual case. Extending the idea that non-equilibrium diffusion in other ordered systems involves a non-equilibrium quantity, this work suggests that in a superconductor the order parameter $Δ$ can decay but not diffuse, but a non-equilibrium gap-like $δ$, due to pair excitations, can both decay and diffuse.

preprint2022arXiv

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark.

preprint2021arXiv

From Machine Learning to Transfer Learning in Laser-Induced Breakdown Spectroscopy: the Case of Rock Analysis for Mars Exploration

With the ChemCam instrument, laser-induced breakdown spectroscopy (LIBS) has successively contributed to Mars exploration by determining elemental compositions of the soil, crust and rocks. Two new lunched missions, Chinese Tianwen 1 and American Perseverance, will further increase the number of LIBS instruments on Mars after the planned landings in spring 2021. Such unprecedented situation requires a reinforced research effort on the methods of LIBS spectral data treatment. Although the matrix effects correspond to a general issue in LIBS, they become accentuated in the case of rock analysis for Mars exploration, because of the large variation of rock composition leading to the chemical matrix effect, and the difference in morphology between laboratory standard samples (in pressed pellet, glass or ceramics) used to establish calibration models and natural rocks encountered on Mars, leading to the physical matric effect. The chemical matrix effect has been tackled in the ChemCam project with large sets of laboratory standard samples offering a good representation of various compositions of Mars rocks. The present work deals with the physical matrix effect which is still expecting a satisfactory solution. The approach consists in introducing transfer learning in LIBS data treatment. For the specific case of total alkali-silica (TAS) classification of natural rocks, the results show a significant improvement of the prediction capacity of pellet sample-based models when trained together with suitable information from rocks in a procedure of transfer learning. The correct classification rate of rocks increases from 33.3% with a machine learning model to 83.3% with a transfer learning model.

preprint2020arXiv

Analytical expressions of variable specific yield for layered soils in shallow water table environments

This paper presents analytical expressions of variable specific yield for layered soils in shallow water table environments, with introducing two distinct concepts of point specific yield (Syp) and interval average specific yield (Syi). The Syp and Syi refer to the specific yield for the water table fluctuation approaching zero infinitely and that for an interval fluctuation of water table, respectively. On the basis of specific yield definition and van Genuchten model of soil water retention, the analytical and semi-analytical expressions were respectively proposed for Syp and Syi towards layered soils. The analytical expressions are evaluated and verified by experimental data and comparison with the previous expressions. Analyses indicate our expressions for Syp and Syi could effectively reflect the changes and nonlinear properties affected by soil hydraulic properties and soil layering under shallow water table conditions. The previously confused understanding of Syp and Syi are also distinguished. The practicality and applicability for the specific yield expressions are comprehensively analyzed for the potential applications related to the subsurface water modeling and management issues.

preprint2020arXiv

Beam-Domain Secret Key Generation for Multi-User Massive MIMO Networks

Physical-layer key generation (PKG) in multi-user massive MIMO networks faces great challenges due to the large length of pilots and the high dimension of channel matrix. To tackle these problems, we propose a novel massive MIMO key generation scheme with pilot reuse based on the beam domain channel model and derive close-form expression of secret key rate. Specifically, we present two algorithms, i.e., beam-domain based channel probing (BCP) algorithm and interference neutralization based multi-user beam allocation (IMBA) algorithm for the purpose of channel dimension reduction and multi-user pilot reuse, respectively. Numerical results verify that the proposed PKG scheme can achieve the secret key rate that approximates the perfect case, and significantly reduce the dimension of the channel estimation and pilot overhead.

preprint2020arXiv

Integrable multistate Landau-Zener models with parallel energy levels

We discuss solvable multistate Landau-Zener (MLZ) models whose Hamiltonians have commuting partner operators with $\sim 1/τ$-time-dependent parameters. Many already known solvable MLZ models belong precisely to this class. We derive the integrability conditions on the parameters of such commuting operators, and demonstrate how to use such conditions in order to derive new solvable cases. We show that MLZ models from this class must contain bands of parallel diabatic energy levels. The structure of the scattering matrix and other properties are found to be the same as in the previously discussed completely solvable MLZ Hamiltonians.

preprint2020arXiv

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

preprint2020arXiv

Multitime Landau-Zener model: classification of solvable Hamiltonians

We discuss a class of models that generalize the two-state Landau-Zener (LZ) Hamiltonian to both the multistate and multitime evolution. It is already known that the corresponding quantum mechanical evolution can be understood in great detail. Here, we present an approach to classify such solvable models, namely, to identify all their independent families for a given number $N$ of interacting states and prove the absence of such families for some types of interactions. We also discuss how, within a solvable family, one can classify the scattering matrices, i.e., the system&#39;s dynamics. Due to the possibility of such a detailed classification, the multitime Landau-Zener (MTLZ) model defines a useful special function of theoretical physics.

preprint2020arXiv

Speech2Action: Cross-modal Supervision for Action Recognition

Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

preprint2020arXiv

Sum Secret Key Rate Maximization for TDD Multi-User Massive MIMO Wireless Networks

Physical-layer key generation (PKG) based on channel reciprocity has recently emerged as a new technique to establish secret keys between devices. Most works focus on pairwise communication scenarios with single or small-scale antennas. However, the fifth generation (5G) wireless communications employ massive multiple-input multiple-output (MIMO) to support multiple users simultaneously, bringing serious overhead of reciprocal channel acquisition. This paper presents a multi-user secret key generation in massive MIMO wireless networks. We provide a beam domain channel model, in which different elements represent the channel gains from different transmit directions to different receive directions. Based on this channel model, we analyze the secret key rate and derive a closed-form expression under independent channel conditions. To maximize the sum secret key rate, we provide the optimal conditions for the Kronecker product of the precoding and receiving matrices and propose an algorithm to generate these matrices with pilot reuse. The proposed optimization design can significantly reduce the pilot overhead of the reciprocal channel state information acquisition. Furthermore, we analyze the security under the channel correlation between user terminals (UTs), and propose a low overhead multi-user secret key generation with non-overlapping beams between UTs. Simulation results demonstrate the near optimal performance of the proposed precoding and receiving matrices design and the advantages of the non-overlapping beam allocation.

preprint2020arXiv

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the results of the first edition of the challenge together with the findings of the participants.

preprint2020arXiv

TNT: Target-driveN Trajectory Prediction

Predicting the future behavior of moving agents is essential for real world applications. It is challenging as the intent of the agent and the corresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states. This leads to our target-driven trajectory prediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent&#39;s potential target states $T$ steps into the future, by encoding its interactions with the environment and the other agents. TNT then generates trajectory state sequences conditioned on targets. A final stage estimates trajectory likelihoods and a final compact set of trajectory predictions is selected. This is in contrast to previous work which models agent intents as latent variables, and relies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectory prediction of vehicles and pedestrians, where we outperform state-of-the-art on Argoverse Forecasting, INTERACTION, Stanford Drone and an in-house Pedestrian-at-Intersection dataset.

preprint2020arXiv

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind. A major contributing factor has been the prohibitive cost of annotating videos frame-by-frame. In this paper, we present a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate. Our method leverages per-frame person detectors which have been trained on large image datasets within a Multiple Instance Learning framework. We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid using a novel probabilistic variant of MIL where we estimate the uncertainty of each prediction. Furthermore, we report the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly-supervised methods on UCF101-24.

preprint2020arXiv

Unsupervised Learning of Object Structure and Dynamics from Videos

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

preprint2020arXiv

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components. In contrast to most recent approaches, which render trajectories of moving agents and road context information as bird-eye images and encode them with convolutional neural networks (ConvNets), our approach operates on a vector representation. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. To further boost VectorNet&#39;s capability in learning context features, we propose a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset. Our method achieves on par or better performance than the competitive rendering approach on both benchmarks while saving over 70% of the model parameters with an order of magnitude reduction in FLOPs. It also outperforms the state of the art on the Argoverse dataset.