Source author record

Xinyu Huang

Xinyu Huang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision eess.IV Emerging Technologies Machine Learning Multimedia Networking and Internet Architecture Robotics eess.SP Information Theory math.IT

Catalog footprint

What is connected

23works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

preprint2025arXiv

Trust in LLM-controlled Robotics: a Survey of Security Threats, Defenses and Challenges

The integration of Large Language Models (LLMs) into robotics has revolutionized their ability to interpret complex human commands and execute sophisticated tasks. However, such paradigm shift introduces critical security vulnerabilities stemming from the ''embodiment gap'', a discord between the LLM's abstract reasoning and the physical, context-dependent nature of robotics. While security for text-based LLMs is an active area of research, existing solutions are often insufficient to address the unique threats for the embodied robotic agents, where malicious outputs manifest not merely as harmful text but as dangerous physical actions. In this work, we present a systematic survey, summarizing the emerging threat landscape and corresponding defense strategies for LLM-controlled robotics. Specifically, we discuss a comprehensive taxonomy of attack vectors, covering topics such as jailbreaking, backdoor attacks, and multi-modal prompt injection. In response, we analyze and categorize a range of defense mechanisms, from formal safety specifications and runtime enforcement to multi-LLM oversight and prompt hardening. Furthermore, we review key datasets and benchmarks used to evaluate the robustness of these embodied systems. By synthesizing current research, this work highlights the urgent need for context-aware security solutions and provides a foundational roadmap for the development of safe, secure, and reliable LLM-controlled robotics.

preprint2022arXiv

A Comparative Study of Gastric Histopathology Sub-size Image Classification: from Linear Regression to Visual Transformer

Gastric cancer is the fifth most common cancer in the world. At the same time, it is also the fourth most deadly cancer. Early detection of cancer exists as a guide for the treatment of gastric cancer. Nowadays, computer technology has advanced rapidly to assist physicians in the diagnosis of pathological pictures of gastric cancer. Ensemble learning is a way to improve the accuracy of algorithms, and finding multiple learning models with complementarity types is the basis of ensemble learning. The complementarity of sub-size pathology image classifiers when machine performance is insufficient is explored in this experimental platform. We choose seven classical machine learning classifiers and four deep learning classifiers for classification experiments on the GasHisSDB database. Among them, classical machine learning algorithms extract five different image virtual features to match multiple classifier algorithms. For deep learning, we choose three convolutional neural network classifiers. In addition, we also choose a novel Transformer-based classifier. The experimental platform, in which a large number of classical machine learning and deep learning methods are performed, demonstrates that there are differences in the performance of different classifiers on GasHisSDB. Classical machine learning models exist for classifiers that classify Abnormal categories very well, while classifiers that excel in classifying Normal categories also exist. Deep learning models also exist with multiple models that can be complementarity. Suitable classifiers are selected for ensemble learning, when machine performance is insufficient. This experimental platform demonstrates that multiple classifiers are indeed complementarity and can improve the efficiency of ensemble learning. This can better assist doctors in diagnosis, improve the detection of gastric cancer, and increase the cure rate.

preprint2022arXiv

A State-of-the-art Survey of Artificial Neural Networks for Whole-slide Image Analysis:from Popular Convolutional Neural Networks to Potential Visual Transformers

To increase the objectivity and accuracy of pathologists' work, artificial neural network(ANN) methods have been generally needed in the segmentation, classification, and detection of histopathological WSI. In this paper, WSI analysis methods based on ANN are reviewed. Firstly, the development status of WSI and ANN methods is introduced. Secondly, we summarize the common ANN methods. Next, we discuss publicly available WSI datasets and evaluation metrics. These ANN architectures for WSI processing are divided into classical neural networks and deep neural networks(DNNs) and then analyzed. Finally, the application prospect of the analytical method in this field is discussed. The important potential method is Visual Transformers.

preprint2022arXiv

A State-of-the-art Survey of U-Net in Microscopic Image Analysis: from Simple Usage to Structure Mortification

Image analysis technology is used to solve the inadvertences of artificial traditional methods in disease, wastewater treatment, environmental change monitoring analysis and convolutional neural networks (CNN) play an important role in microscopic image analysis. An important step in detection, tracking, monitoring, feature extraction, modeling and analysis is image segmentation, in which U-Net has increasingly applied in microscopic image segmentation. This paper comprehensively reviews the development history of U-Net, and analyzes various research results of various segmentation methods since the emergence of U-Net and conducts a comprehensive review of related papers. First, this paper has summarized the improved methods of U-Net and then listed the existing significance of image segmentation techniques and their improvements that has introduced over the years. Finally, focusing on the different improvement strategies of U-Net in different papers, the related work of each application target is reviewed according to detailed technical categories to facilitate future research. Researchers can clearly see the dynamics of transmission of technological development and keep up with future trends in this interdisciplinary field.

preprint2022arXiv

Analysis of MC Systems Employing Receivers Covered by Heterogeneous Receptors

This paper investigates the channel impulse response (CIR), i.e., the molecule hitting rate, of a molecular communication (MC) system employing an absorbing receiver (RX) covered by multiple non overlapping receptors. In this system, receptors are heterogeneous, i.e., they may have different sizes and arbitrary locations. Furthermore, we consider two types of transmitter (TX), namely a point TX and a membrane fusion (MF)-based spherical TX. We assume the point TX or the center of the MF-based TX has a fixed distance to the center of the RX. Given this fixed distance, the TX can be at different locations and the CIR of the RX depends on the exact location of the TX. By averaging over all possible TX locations, we analyze the expected molecule hitting rate at the RX as a function of the sizes and locations of the receptors, where we assume molecule degradation may occur during the propagation of the signaling molecules. Notably, our analysis is valid for different numbers, a wide range of sizes, and arbitrary locations of the receptors, and its accuracy is confirmed via particle-based simulations. Exploiting our numerical results, we show that the expected number of absorbed molecules at the RX increases with the number of receptors, when the total area on the RX surface covered by receptors is fixed. Based on the derived analytical expressions, we compare different geometric receptor distributions by examining the expected number of absorbed molecules at the RX. We show that evenly distributed receptors result in a larger number of absorbed molecules than other distributions. We further compare three models that combine different types of TXs and RXs.

preprint2022arXiv

Analysis of Receiver Covered by Heterogeneous Receptors in Molecular Communications

This paper analyzes the channel impulse response of an absorbing receiver (RX) covered by multiple non-overlapping heterogeneous receptors with different sizes and arbitrary locations in a molecular communication system. In this system, a point transmitter (TX) is assumed to be uniformly located on a virtual sphere at a fixed distance from the RX. Considering molecule degradation during the propagation from the TX to the RX, the expected molecule hitting rate at the RX over varying locations of the TX is analyzed as a function of the size and location of each receptor. Notably, this analytical result is applicable for different numbers, sizes, and locations of receptors, and its accuracy is demonstrated via particle-based simulations. Numerical results show that (i) the expected number of absorbed molecules at the RX increases with an increasing number of receptors, when the total area of receptors on the RX surface is fixed, and (ii) evenly distributed receptors lead to the largest expected number of absorbed molecules.

preprint2022arXiv

Application of Graph Based Features in Computer Aided Diagnosis for Histopathological Image Classification of Gastric Cancer

The gold standard for gastric cancer detection is gastric histopathological image analysis, but there are certain drawbacks in the existing histopathological detection and diagnosis. In this paper, based on the study of computer aided diagnosis system, graph based features are applied to gastric cancer histopathology microscopic image analysis, and a classifier is used to classify gastric cancer cells from benign cells. Firstly, image segmentation is performed, and after finding the region, cell nuclei are extracted using the k-means method, the minimum spanning tree (MST) is drawn, and graph based features of the MST are extracted. The graph based features are then put into the classifier for classification. In this study, different segmentation methods are compared in the tissue segmentation stage, among which are Level-Set, Otsu thresholding, watershed, SegNet, U-Net and Trans-U-Net segmentation; Graph based features, Red, Green, Blue features, Grey-Level Co-occurrence Matrix features, Histograms of Oriented Gradient features and Local Binary Patterns features are compared in the feature extraction stage; Radial Basis Function (RBF) Support Vector Machine (SVM), Linear SVM, Artificial Neural Network, Random Forests, k-NearestNeighbor, VGG16, and Inception-V3 are compared in the classifier stage. It is found that using U-Net to segment tissue areas, then extracting graph based features, and finally using RBF SVM classifier gives the optimal results with 94.29%.

preprint2022arXiv

Application of Transfer Learning and Ensemble Learning in Image-level Classification for Breast Histopathology

Background: Breast cancer has the highest prevalence in women globally. The classification and diagnosis of breast cancer and its histopathological images have always been a hot spot of clinical concern. In Computer-Aided Diagnosis (CAD), traditional classification models mostly use a single network to extract features, which has significant limitations. On the other hand, many networks are trained and optimized on patient-level datasets, ignoring the application of lower-level data labels. Method: This paper proposes a deep ensemble model based on image-level labels for the binary classification of benign and malignant lesions of breast histopathological images. First, the BreaKHis dataset is randomly divided into a training, validation and test set. Then, data augmentation techniques are used to balance the number of benign and malignant samples. Thirdly, considering the performance of transfer learning and the complementarity between each network, VGG16, Xception, ResNet50, DenseNet201 are selected as the base classifiers. Result: In the ensemble network model with accuracy as the weight, the image-level binary classification achieves an accuracy of $98.90\%$. In order to verify the capabilities of our method, the latest Transformer and Multilayer Perception (MLP) models have been experimentally compared on the same dataset. Our model wins with a $5\%-20\%$ advantage, emphasizing the ensemble model's far-reaching significance in classification tasks. Conclusion: This research focuses on improving the model's classification performance with an ensemble algorithm. Transfer learning plays an essential role in small datasets, improving training speed and accuracy. Our model has outperformed many existing approaches in accuracy, providing a method for the field of auxiliary medical diagnosis.

preprint2022arXiv

EBHI:A New Enteroscope Biopsy Histopathological H&E Image Dataset for Image Classification Evaluation

Background and purpose: Colorectal cancer has become the third most common cancer worldwide, accounting for approximately 10% of cancer patients. Early detection of the disease is important for the treatment of colorectal cancer patients. Histopathological examination is the gold standard for screening colorectal cancer. However, the current lack of histopathological image datasets of colorectal cancer, especially enteroscope biopsies, hinders the accurate evaluation of computer-aided diagnosis techniques. Methods: A new publicly available Enteroscope Biopsy Histopathological H&E Image Dataset (EBHI) is published in this paper. To demonstrate the effectiveness of the EBHI dataset, we have utilized several machine learning, convolutional neural networks and novel transformer-based classifiers for experimentation and evaluation, using an image with a magnification of 200x. Results: Experimental results show that the deep learning method performs well on the EBHI dataset. Traditional machine learning methods achieve maximum accuracy of 76.02% and deep learning method achieves a maximum accuracy of 95.37%. Conclusion: To the best of our knowledge, EBHI is the first publicly available colorectal histopathology enteroscope biopsy dataset with four magnifications and five types of images of tumor differentiation stages, totaling 5532 images. We believe that EBHI could attract researchers to explore new classification algorithms for the automated diagnosis of colorectal cancer, which could help physicians and patients in clinical settings.

preprint2022arXiv

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.

preprint2022arXiv

IL-MCAM: An interactive learning and multi-channel attention mechanism-based weakly supervised colorectal histopathology image classification approach

In recent years, colorectal cancer has become one of the most significant diseases that endanger human health. Deep learning methods are increasingly important for the classification of colorectal histopathology images. However, existing approaches focus more on end-to-end automatic classification using computers rather than human-computer interaction. In this paper, we propose an IL-MCAM framework. It is based on attention mechanisms and interactive learning. The proposed IL-MCAM framework includes two stages: automatic learning (AL) and interactivity learning (IL). In the AL stage, a multi-channel attention mechanism model containing three different attention mechanism channels and convolutional neural networks is used to extract multi-channel features for classification. In the IL stage, the proposed IL-MCAM framework continuously adds misclassified images to the training set in an interactive approach, which improves the classification ability of the MCAM model. We carried out a comparison experiment on our dataset and an extended experiment on the HE-NCT-CRC-100K dataset to verify the performance of the proposed IL-MCAM framework, achieving classification accuracies of 98.98% and 99.77%, respectively. In addition, we conducted an ablation experiment and an interchangeability experiment to verify the ability and interchangeability of the three channels. The experimental results show that the proposed IL-MCAM framework has excellent performance in the colorectal histopathological image classification tasks.

preprint2022arXiv

OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion

A well-known challenge in applying deep-learning methods to omnidirectional images is spherical distortion. In dense regression tasks such as depth estimation, where structural details are required, using a vanilla CNN layer on the distorted 360 image results in undesired information loss. In this paper, we propose a 360 monocular depth estimation pipeline, OmniFusion, to tackle the spherical distortion issue. Our pipeline transforms a 360 image into less-distorted perspective patches (i.e. tangent images) to obtain patch-wise predictions via CNN, and then merge the patch-wise results for final output. To handle the discrepancy between patch-wise predictions which is a major issue affecting the merging quality, we propose a new framework with the following key components. First, we propose a geometry-aware feature fusion mechanism that combines 3D geometric features with 2D image features to compensate for the patch-wise discrepancy. Second, we employ the self-attention-based transformer architecture to conduct a global aggregation of patch-wise information, which further improves the consistency. Last, we introduce an iterative depth refinement mechanism, to further refine the estimated depth based on the more accurate geometric features. Experiments show that our method greatly mitigates the distortion issue, and achieves state-of-the-art performances on several 360 monocular depth estimation benchmark datasets.

preprint2022arXiv

Personalized QoE Enhancement for Adaptive Video Streaming: A Digital Twin-Assisted Scheme

In this paper, we present a digital twin (DT)-assisted adaptive video streaming scheme to enhance personalized quality-of-experience (PQoE). Since PQoE models are user-specific and time-varying, existing schemes based on universal and time-invariant PQoE models may suffer from performance degradation. To address this issue, we first propose a DT-assisted PQoE model construction method to obtain accurate user-specific PQoE models. Specifically, user DTs (UDTs) are respectively constructed for individual users, which can acquire and utilize users' data to accurately tune PQoE model parameters in real time. Next, given the obtained PQoE models, we formulate a resource management problem to maximize the overall long-term PQoE by taking the dynamics of user' locations, video content requests, and buffer statuses into account. To solve this problem, a deep reinforcement learning algorithm is developed to jointly determine segment version selection, and communication and computing resource allocation. Simulation results on the real-world dataset demonstrate that the proposed scheme can effectively enhance PQoE compared with benchmark schemes.

preprint2022arXiv

Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation

We propose a keypoint-based object-level SLAM framework that can provide globally consistent 6DoF pose estimates for symmetric and asymmetric objects alike. To the best of our knowledge, our system is among the first to utilize the camera pose information from SLAM to provide prior knowledge for tracking keypoints on symmetric objects -- ensuring that new measurements are consistent with the current 3D scene. Moreover, our semantic keypoint network is trained to predict the Gaussian covariance for the keypoints that captures the true error of the prediction, and thus is not only useful as a weight for the residuals in the system's optimization problems, but also as a means to detect harmful statistical outliers without choosing a manual threshold. Experiments show that our method provides competitive performance to the state of the art in 6DoF object pose estimation, and at a real-time speed. Our code, pre-trained models, and keypoint labels are available https://github.com/rpng/suo_slam.

preprint2022arXiv

TOD-CNN: An Effective Convolutional Neural Network for Tiny Object Detection in Sperm Videos

The detection of tiny objects in microscopic videos is a problematic point, especially in large-scale experiments. For tiny objects (such as sperms) in microscopic videos, current detection methods face challenges in fuzzy, irregular, and precise positioning of objects. In contrast, we present a convolutional neural network for tiny object detection (TOD-CNN) with an underlying data set of high-quality sperm microscopic videos (111 videos, $>$ 278,000 annotated objects), and a graphical user interface (GUI) is designed to employ and test the proposed model effectively. TOD-CNN is highly accurate, achieving $85.60\%$ AP$_{50}$ in the task of real-time sperm detection in microscopic videos. To demonstrate the importance of sperm detection technology in sperm quality analysis, we carry out relevant sperm quality evaluation metrics and compare them with the diagnosis results from medical doctors.

preprint2021arXiv

Membrane Fusion-Based Transmitter Design for Molecular Communication Systems

This paper proposes a novel imperfect spherical transmitter (TX) model, namely the membrane fusion (MF)-based TX, that adopts MF between a vesicle and the TX membrane to release molecules encapsulated within the vesicle. For the MF-based TX, the molecule release probability and the fraction of molecules released from the TX membrane are derived. Incorporating molecular degradation and a fully-absorbing receiver (RX), the end-to-end molecule hitting probability at the RX is also derived. A simulation framework for the MF-based TX is proposed, where the released point on the TX membrane and the released time of each molecule are determined. Aided by the simulation framework, the derived analytical expressions are validated. Simulation results verify that a low MF probability or low vesicle mobility slows the release of molecules from the TX, extends time required to reach the peak release probability, and reduces the end-to-end molecule hitting probability at the RX.

preprint2020arXiv

Channel Characterization for 1D Molecular Communication with Two Absorbing Receivers

This letter develops a one-dimensional (1D) diffusion-based molecular communication system to analyze channel responses between a single transmitter (TX) and two fully-absorbing receivers (RXs). Incorporating molecular degradation in the environment, rigorous analytical formulas for i) the fraction of molecules absorbed, ii) the corresponding hitting rate, and iii) the asymptotic fraction of absorbed molecules as time approaches infinity at each RX are derived when an impulse of molecules are released at the TX. By using particle-based simulations, the derived analytical expressions are validated. Simulations also present the distance ranges of two RXs that do not impact molecular absorption of each other, and demonstrate that the mutual imfluence of two active RXs reduces with the increase in the degradation rate.

preprint2020arXiv

Parameter Estimation in a Noisy 1D Environment via Two Absorbing Receivers

This paper investigates the estimation of different parameters, e.g., propagation distance and flow velocity, by utilizing two fully-absorbing receivers (RXs) in a one-dimensional (1D) environment. The time-varying number of absorbed molecules at each RX and the number of absorbed molecules in a time interval as time approaches infinity are derived. Noisy molecules in this environment, that are released by sources in addition to the transmitter, are also considered. A novel estimation method, namely difference estimation (DE), is proposed to eliminate the effect of noise by using the difference of received signals at the two RXs. For DE, the Cramer-Rao lower bound (CRLB) on the variance of estimation is derived. Independent maximum likelihood estimation is also considered at each RX as a benchmark to show the performance advantage of DE. Aided by particle-based simulation, the derived analytical results are verified. Furthermore, numerical results show that DE attains the CRLB and is less sensitive to the change of noise than independent estimation at each RX.

preprint2020arXiv

Playback experience driven cross layer optimisation of APP, transport and MAC layer for video clients over long-term evolution system

In traditional communication system, information of APP (Application) layer, transport layer and MAC (Media Access Control)layer has not been fully interacted,which inevitably leads to inconsistencies among TCP congestion state, clients'requirements and resource allocation. To solve the problem, we propose a joint optimization framework, which consists of APP layer, transport layer and MAC layer, to improve the video clients'playback experience and system throughput. First, a client requirement aware autonomous packet drop strategy, based on packet importance, channel condition and playback status, is developed to decrease the network load and the probability of rebuffering events. Further, TCP (Transmission Control Protocol) state aware downlink and uplink resource allocation schemes are proposed to achieve smooth video transmission and steady ACK (Acknowledgement) feedback respectively. For downlink scheme, maximum transmission capacity requirement for each client is calculated based on feedback ACK information from transport layer to avoid allocating excessive resource to the client, whose ACK feedback is blocked due to bad uplink channel condition. For uplink scheme, information of RTO (Retransmission Timeout) and TCP congestion window are utilized to indicate ACK scheduling priority. The simulation results show that our algorithm can signficantly improve the system throughput and the clients'playback continuity with acceptable video quality.

preprint2020arXiv

Towards 5G: Joint Optimization of Video Segment Cache, Transcoding and Resource Allocation for Adaptive Video Streaming in a Muti-access Edge Computing Network

The cache and transcoding of the multi-access edge computing (MEC) server and wireless resource allocation in eNodeB interact and determine the quality of experience (QoE) of dynamic adaptive streaming over HTTP (DASH) clients in MEC networks. However, the relationship among the three factors has not been explored, which has led to limited improvement in clients' QoE. Therefore, we propose a joint optimization framework of video segment cache and transcoding in MEC servers and resource allocation to improve the QoE of DASH clients. Based on the established framework, we develop a MEC cache management mechanism that consists of the MEC cache partition, video segment deletion, and MEC cache space transfer. Then, a joint optimization algorithm that combines video segment cache and transcoding in the MEC server and resource allocation is proposed. In the algorithm, the clients' channel state and the playback status and cooperation among MEC servers are employed to estimate the client's priority, video segment presentation switch and continuous playback time. Considering the above four factors, we develop a utility function model of clients' QoE. Then, we formulate a mixed-integer nonlinear programming mathematical model to maximize the total utility of DASH clients, where the video segment cache and transcoding strategy and resource allocation strategy are jointly optimized. To solve this problem, we propose a low-complexity heuristic algorithm that decomposes the original problem into multiple subproblems. The simulation results show that our proposed algorithms efficiently improve client's throughput, received video quality and hit ratio of video segments while decreasing the playback rebuffering time, video segment presentation switch and system backhaul traffic.

preprint2020arXiv

Vehicle Speed Aware Computing Task Offloading and Resource Allocation Based on Multi-Agent Reinforcement Learning in a Vehicular Edge Computing Network

For in-vehicle application, the vehicles with different speeds have different delay requirements. However, vehicle speeds have not been extensively explored, which may cause mismatching between vehicle speed and its allocated computation and wireless resource. In this paper, we propose a vehicle speed aware task offloading and resource allocation strategy, to decrease the energy cost of executing tasks without exceeding the delay constraint. First, we establish the vehicle speed aware delay constraint model based on different speeds and task types. Then, the delay and energy cost of task execution in VEC server and local terminal are calculated. Next, we formulate a joint optimization of task offloading and resource allocation to minimize vehicles' energy cost subject to delay constraints. MADDPG method is employed to obtain offloading and resource allocation strategy. Simulation results show that our algorithm can achieve superior performance on energy cost and task completion delay.

preprint2016arXiv

Mask-off: Synthesizing Face Images in the Presence of Head-mounted Displays

A head-mounted display (HMD) could be an important component of augmented reality system. However, as the upper face region is seriously occluded by the device, the user experience could be affected in applications such as telecommunication and multi-player video games. In this paper, we first present a novel experimental setup that consists of two near-infrared (NIR) cameras to point to the eye regions and one visible-light RGB camera to capture the visible face region. The main purpose of this paper is to synthesize realistic face images without occlusions based on the images captured by these cameras. To this end, we propose a novel synthesis framework that contains four modules: 3D head reconstruction, face alignment and tracking, face synthesis, and eye synthesis. In face synthesis, we propose a novel algorithm that can robustly align and track a personalized 3D head model given a face that is severely occluded by the HMD. In eye synthesis, in order to generate accurate eye movements and dynamic wrinkle variations around eye regions, we propose another novel algorithm to colorize the NIR eye images and further remove the "red eye" effects caused by the colorization. Results show that both hardware setup and system framework are robust to synthesize realistic face images in video sequences.

Xinyu Huang

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Trust in LLM-controlled Robotics: a Survey of Security Threats, Defenses and Challenges

A Comparative Study of Gastric Histopathology Sub-size Image Classification: from Linear Regression to Visual Transformer

A State-of-the-art Survey of Artificial Neural Networks for Whole-slide Image Analysis:from Popular Convolutional Neural Networks to Potential Visual Transformers

A State-of-the-art Survey of U-Net in Microscopic Image Analysis: from Simple Usage to Structure Mortification

Analysis of MC Systems Employing Receivers Covered by Heterogeneous Receptors

Analysis of Receiver Covered by Heterogeneous Receptors in Molecular Communications

Application of Graph Based Features in Computer Aided Diagnosis for Histopathological Image Classification of Gastric Cancer

Application of Transfer Learning and Ensemble Learning in Image-level Classification for Breast Histopathology

EBHI:A New Enteroscope Biopsy Histopathological H&E Image Dataset for Image Classification Evaluation

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

IL-MCAM: An interactive learning and multi-channel attention mechanism-based weakly supervised colorectal histopathology image classification approach

OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion

Personalized QoE Enhancement for Adaptive Video Streaming: A Digital Twin-Assisted Scheme

Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation

TOD-CNN: An Effective Convolutional Neural Network for Tiny Object Detection in Sperm Videos

Membrane Fusion-Based Transmitter Design for Molecular Communication Systems

Channel Characterization for 1D Molecular Communication with Two Absorbing Receivers

Parameter Estimation in a Noisy 1D Environment via Two Absorbing Receivers

Playback experience driven cross layer optimisation of APP, transport and MAC layer for video clients over long-term evolution system

Towards 5G: Joint Optimization of Video Segment Cache, Transcoding and Resource Allocation for Adaptive Video Streaming in a Muti-access Edge Computing Network

Vehicle Speed Aware Computing Task Offloading and Resource Allocation Based on Multi-Agent Reinforcement Learning in a Vehicular Edge Computing Network

Mask-off: Synthesizing Face Images in the Presence of Head-mounted Displays