Source author record

Xinyu Zhou

Xinyu Zhou appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Methodology Robotics Artificial Intelligence Computer Science and Game Theory Cryptography and Security

Catalog footprint

What is connected

16works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

preprint2026arXiv

The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.

preprint2026arXiv

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

preprint2022arXiv

A New Learning Paradigm for Stochastic Configuration Network: SCN+

Learning using privileged information (LUPI) paradigm, which pioneered teacher-student interaction mechanism, makes the learning models use additional information in training stage. This paper is the first to propose an incremental learning algorithm with LUPI paradigm for stochastic configuration network (SCN), named SCN+. This novel algorithm can leverage privileged information into SCN in the training stage, which provides a new method to train SCN. Moreover, the convergences have been studied in this paper. Finally, experimental results indicate that SCN+ indeed performs favorably.

preprint2022arXiv

A Stochastic Process Model for Time Warping Functions

Time warping function provides a mathematical representation to measure phase variability in functional data. Recent studies have developed various approaches to estimate optimal warping between functions and provide non-Euclidean models. However, a principled, linear, generative model on time warping functions is still under-explored. This is a highly challenging problem because the space of warping functions is non-linear with the conventional Euclidean metric. To address this problem, we propose a stochastic process model for time warping functions, where the key is to define a linear, inner-product structure on the time warping space and then transform the warping functions into a sub-space of the $\mathbb L^2$ Euclidean space. With certain constraints on the warping functions, this transformation is an isometric isomorphism. In the transformed space, we adopt the $\mathbb L^2$ basis in the Hilbert space for representation. This new framework can easily build generative model on time warping by using different types of stochastic process. It can also be used to conduct statistical inferences such as functional PCA, functional ANOVA, and functional regressions. Furthermore, we demonstrate the effectiveness of this new framework by using it as a new prior in the Bayesian registration, and propose an efficient gradient method to address the important maximum a posteriori estimation. We illustrate the new Bayesian method using simulations which properly characterize nonuniform and correlated constraints in the time domain. Finally, we apply the new framework to the famous Berkeley growth data and obtain reasonable results on modeling, resampling, group comparison, and classification analysis.

preprint2022arXiv

Nanorobot queue: Cooperative treatment of cancer based on team member communication and image processing

Although nanorobots have been used as clinical prescriptions for work such as gastroscopy, and even photoacoustic tomography technology has been proposed to control nanorobots to deliver drugs at designated delivery points in real time, and there are cases of eliminating "superbacteria" in blood through nanorobots, most technologies are immature, either with low efficiency or low accuracy, Either it can not be mass produced, so the most effective way to treat cancer diseases at this stage is through chemotherapy and radiotherapy. Patients are suffering and can not be cured. Therefore, this paper proposes an ideal model of a treatment method that can completely cure cancer, a cooperative treatment method based on nano robot queue through team member communication and computer vision image classification (target detection).

preprint2022arXiv

Statistical Depth for Point Process via the Isometric Log-Ratio Transformation

Statistical depth, a useful tool to measure the center-outward rank of multivariate and functional data, is still under-explored in temporal point processes. Recent studies on point process depth proposed a weighted product of two terms - one indicates the depth of the cardinality of the process, and the other characterizes the conditional depth of the temporal events given the cardinality. The second term is of great challenge because of the apparent nonlinear structure of event times, and so far only basic parametric representations such as Gaussian and Dirichlet densities were adopted in the definitions. However, these simplified forms ignore the underlying distribution of the process events, which makes the methods difficult to interpret and to apply to complicated patterns. To deal with these problems, we in this paper propose a distribution-based approach to the conditional depth via the well-known Isometric Log-Ratio (ILR) transformation on the inter-event times. The new depth, called the ILR depth, is at first defined for homogeneous Poisson process by using the density function on the transformed space. The definition is then extended to any general point process via a time-rescaling transformation. We illustrate the ILR depth using simulations of Poisson and non-Poisson processes and demonstrate its superiority over previous methods. We also thoroughly examine its mathematical properties and asymptotics in large samples. Finally, we apply the ILR depth in a real dataset and the result clearly shows the effectiveness of the new method.

preprint2021arXiv

RPPLNS: Pay-per-last-N-shares with a Randomised Twist

"Pay-per-last-$N$-shares" (PPLNS) is one of the most common payout strategies used by mining pools in Proof-of-Work (PoW) cryptocurrencies. As with any payment scheme, it is imperative to study issues of incentive compatibility of miners within the pool. For PPLNS this question has only been partially answered; we know that reasonably-sized miners within a PPLNS pool prefer following the pool protocol over employing specific deviations. In this paper, we present a novel modification to PPLNS where we randomise the protocol in a natural way. We call our protocol "Randomised pay-per-last-$N$-shares" (RPPLNS), and note that the randomised structure of the protocol greatly simplifies the study of its incentive compatibility. We show that RPPLNS maintains the strengths of PPLNS (i.e., fairness, variance reduction, and resistance to pool hopping), while also being robust against a richer class of strategic mining than what has been shown for PPLNS.

preprint2020arXiv

DPGN: Distribution Propagation Graph Network for Few-shot Learning

Most graph-network-based meta-learning approaches model instance-level relation of examples. We extend this idea further to explicitly model the distribution-level relation of one example to all other examples in a 1-vs-N manner. We propose a novel approach named distribution propagation graph network (DPGN) for few-shot learning. It conveys both the distribution-level relations and instance-level relations in each few-shot learning task. To combine the distribution-level relations and instance-level relations for all examples, we construct a dual complete graph network which consists of a point graph and a distribution graph with each node standing for an example. Equipped with dual graph architecture, DPGN propagates label information from labeled examples to unlabeled examples within several update generations. In extensive experiments on few-shot learning benchmarks, DPGN outperforms state-of-the-art results by a large margin in 5% $\sim$ 12% under supervised setting and 7% $\sim$ 13% under semi-supervised setting. Code will be released.

preprint2020arXiv

Learning Delicate Local Representations for Multi-Person Pose Estimation

In this paper, we propose a novel method called Residual Steps Network (RSN). RSN aggregates features with the same spatial size (Intra-level features) efficiently to obtain delicate local representations, which retain rich low-level spatial information and result in precise keypoint localization. Additionally, we observe the output features contribute differently to final performance. To tackle this problem, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to make a trade-off between local and global representations in output features and further refine the keypoint locations. Our approach won the 1st place of COCO Keypoint Challenge 2019 and achieves state-of-the-art results on both COCO and MPII benchmarks, without using extra training data and pretrained model. Our single model achieves 78.6 on COCO test-dev, 93.0 on MPII test dataset. Ensembled models achieve 79.2 on COCO test-dev, 77.1 on COCO test-challenge dataset. The source code is publicly available for further research at https://github.com/caiyuanhao1998/RSN/

preprint2016arXiv

Effective Quantization Methods for Recurrent Neural Networks

Reducing bit-widths of weights, activations, and gradients of a Neural Network can shrink its storage size and memory usage, and also allow for faster training and inference by exploiting bitwise operations. However, previous attempts for quantization of RNNs show considerable performance degradation when using low bit-width weights and activations. In this paper, we propose methods to quantize the structure of gates and interlinks in LSTM and GRU cells. In addition, we propose balanced quantization methods for weights to further reduce performance degradation. Experiments on PTB and IMDB datasets confirm effectiveness of our methods as performances of our models match or surpass the previous state-of-the-art of quantized RNN.

preprint2016arXiv

Exploiting Local Structures with the Kronecker Layer in Convolutional Networks

In this paper, we propose and study a technique to reduce the number of parameters and computation time in convolutional neural networks. We use Kronecker product to exploit the local structures within convolution and fully-connected layers, by replacing the large weight matrices by combinations of multiple Kronecker products of smaller matrices. Just as the Kronecker product is a generalization of the outer product from vectors to matrices, our method is a generalization of the low rank approximation method for convolution neural networks. We also introduce combinations of different shapes of Kronecker product to increase modeling capacity. Experiments on SVHN, scene text recognition and ImageNet dataset demonstrate that we can achieve $3.3 \times$ speedup or $3.6 \times$ parameter reduction with less than 1\% drop in accuracy, showing the effectiveness and efficiency of our method. Moreover, the computation efficiency of Kronecker layer makes using larger feature map possible, which in turn enables us to outperform the previous state-of-the-art on both SVHN(digit recognition) and CASIA-HWDB (handwritten Chinese character recognition) datasets.

preprint2016arXiv

Incidental Scene Text Understanding: Recent Progresses on ICDAR 2015 Robust Reading Competition Challenge 4

Different from focused texts present in natural images, which are captured with user's intention and intervention, incidental texts usually exhibit much more diversity, variability and complexity, thus posing significant difficulties and challenges for scene text detection and recognition algorithms. The ICDAR 2015 Robust Reading Competition Challenge 4 was launched to assess the performance of existing scene text detection and recognition methods on incidental texts as well as to stimulate novel ideas and solutions. This report is dedicated to briefly introduce our strategies for this challenging problem and compare them with prior arts in this field.

preprint2016arXiv

Scene Text Detection via Holistic, Multi-Channel Prediction

Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.

preprint2016arXiv

Training Bit Fully Convolutional Network for Fast Semantic Segmentation

Fully convolutional neural networks give accurate, per-pixel prediction for input images and have applications like semantic segmentation. However, a typical FCN usually requires lots of floating point computation and large run-time memory, which effectively limits its usability. We propose a method to train Bit Fully Convolution Network (BFCN), a fully convolutional neural network that has low bit-width weights and activations. Because most of its computation-intensive convolutions are accomplished between low bit-width numbers, a BFCN can be accelerated by an efficient bit-convolution implementation. On CPU, the dot product operation between two bit vectors can be reduced to bitwise operations and popcounts, which can offer much higher throughput than 32-bit multiplications and additions. To validate the effectiveness of BFCN, we conduct experiments on the PASCAL VOC 2012 semantic segmentation task and Cityscapes. Our BFCN with 1-bit weights and 2-bit activations, which runs 7.8x faster on CPU or requires less than 1\% resources on FPGA, can achieve comparable performance as the 32-bit counterpart.

preprint2015arXiv

ICDAR 2015 Text Reading in the Wild Competition

Recently, text detection and recognition in natural scenes are becoming increasing popular in the computer vision community as well as the document analysis community. However, majority of the existing ideas, algorithms and systems are specifically designed for English. This technical report presents the final results of the ICDAR 2015 Text Reading in the Wild (TRW 2015) competition, which aims at establishing a benchmark for assessing detection and recognition algorithms devised for both Chinese and English scripts and providing a playground for researchers from the community. In this article, we describe in detail the dataset, tasks, evaluation protocols and participants of this competition, and report the performance of the participating methods. Moreover, promising directions for future research are discussed.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint