Source author record

Chang Wen Chen

Chang Wen Chen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Networking and Internet Architecture Information Theory math.IT Multimedia Systems and Control Computation and Language Cryptography and Security Machine Learning Neural and Evolutionary Computing

Catalog footprint

What is connected

15works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing

Low-level image processing has long been evaluated mainly from the perspective of visual fidelity. However, with the rise of deep learning and generative models, processed images may preserve perceptual quality while altering semantic content, making conventional Image Quality Assessment (IQA) insufficient for semantic-level assessment. In this paper, we formalize \textit{Semantic Similarity} as a new evaluation task for low-level image processing, aimed at measuring whether semantic content is preserved after processing. We further present a structured formulation of image semantics based on semantic entities and their relations, and discuss the desired properties and constraints of a valid semantic similarity index. Based on this formulation, we propose Triplet-based Semantic Similarity Score (T3S), which models image semantics through foreground entities, background entities, and relations. T3S combines semantic entity extraction, foreground-background disentanglement, and open-world class/relation modeling. Experiments on COCO and SPA-Data show that T3S consistently outperforms existing fidelity-oriented metrics and representative semantic-level baselines, while better reflecting progressive semantic changes under diverse degradations. These results highlight the importance of semantic assessment in modern low-level vision.

preprint2022arXiv

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of <human, action, object> in images. Most existing works treat HOIs as individual interaction categories, thus can not handle the problem of long-tail distribution and polysemy of action labels. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities. We extensively evaluate our model on the challenging V-COCO and HICO-DET datasets, and results validate that our approach outperforms state-of-the-arts under both fully-supervised and zero-shot settings. Code is available at https://github.com/yeliudev/ConsNet.

preprint2022arXiv

Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling

As a technically challenging topic, visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images. Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images. Hence, these schemes could not capture consistent dependencies from holistic representation, impairing the generation of reasonable and fluent story. To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed. Three main novel components are designed and supported by substantial experiments to reveal practical advantages. First, a knowledge-enriched attention network is designed to extract implicit concepts from external knowledge system, and these concepts are followed by a cascade cross-modal attention mechanism to characterize imaginative and concrete representations. Second, a group-wise semantic module with second-order pooling is developed to explore the globally consistent guidance. Third, a unified one-stage story generation model with encoder-decoder structure is proposed to simultaneously train and infer the knowledge-enriched attention network, group-wise semantic module and multi-modal story generation decoder in an end-to-end fashion. Substantial experiments on the popular Visual Storytelling dataset with both objective and subjective evaluation metrics demonstrate the superior performance of the proposed scheme as compared with other state-of-the-art methods.

preprint2022arXiv

Taking an Emotional Look at Video Paragraph Captioning

Translating visual data into natural language is essential for machines to understand the world and interact with humans. In this work, a comprehensive study is conducted on video paragraph captioning, with the goal to generate paragraph-level descriptions for a given video. However, current researches mainly focus on detecting objective facts, ignoring the needs to establish the logical associations between sentences and to discover more accurate emotions related to video contents. Such a problem impairs fluent and abundant expressions of predicted captions, which are far below human language tandards. To solve this problem, we propose to construct a large-scale emotion and logic driven multilingual dataset for this task. This dataset is named EMVPC (standing for "Emotional Video Paragraph Captioning") and contains 53 widely-used emotions in daily life, 376 common scenes corresponding to these emotions, 10,291 high-quality videos and 20,582 elaborated paragraph captions with English and Chinese versions. Relevant emotion categories, scene labels, emotion word labels and logic word labels are also provided in this new dataset. The proposed EMVPC dataset intends to provide full-fledged video paragraph captioning in terms of rich emotions, coherent logic and elaborate expressions, which can also benefit other tasks in vision-language fields. Furthermore, a comprehensive study is conducted through experiments on existing benchmark video paragraph captioning datasets and the proposed EMVPC. The stateof-the-art schemes from different visual captioning tasks are compared in terms of 15 popular metrics, and their detailed objective as well as subjective results are summarized. Finally, remaining problems and future directions of video paragraph captioning are also discussed. The unique perspective of this work is expected to boost further development in video paragraph captioning research.

preprint2022arXiv

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era. Nevertheless, jointly conducting moment retrieval and highlight detection is an emerging research topic, even though its component problems and some related tasks have already been studied for a while. In this paper, we present the first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization while can also be easily degenerated for solving individual problems. As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task, and tackles moment retrieval as a keypoint detection problem using a novel query generator and query decoder. Extensive comparisons with existing methods and ablation studies on QVHighlights, Charades-STA, YouTube Highlights, and TVSum datasets demonstrate the effectiveness, superiority, and flexibility of the proposed method under various settings. Source code and pre-trained models are available at https://github.com/TencentARC/UMT.

preprint2021arXiv

Beyond Fine-tuning: Classifying High Resolution Mammograms using Function-Preserving Transformations

The task of classifying mammograms is very challenging because the lesion is usually small in the high resolution image. The current state-of-the-art approaches for medical image classification rely on using the de-facto method for ConvNets - fine-tuning. However, there are fundamental differences between natural images and medical images, which based on existing evidence from the literature, limits the overall performance gain when designed with algorithmic approaches. In this paper, we propose to go beyond fine-tuning by introducing a novel framework called MorphHR, in which we highlight a new transfer learning scheme. The idea behind the proposed framework is to integrate function-preserving transformations, for any continuous non-linear activation neurons, to internally regularise the network for improving mammograms classification. The proposed solution offers two major advantages over the existing techniques. Firstly and unlike fine-tuning, the proposed approach allows for modifying not only the last few layers but also several of the first ones on a deep ConvNet. By doing this, we can design the network front to be suitable for learning domain specific features. Secondly, the proposed scheme is scalable to hardware. Therefore, one can fit high resolution images on standard GPU memory. We show that by using high resolution images, one prevents losing relevant information. We demonstrate, through numerical and visual experiments, that the proposed approach yields to a significant improvement in the classification performance over state-of-the-art techniques, and is indeed on a par with radiology experts. Moreover and for generalisation purposes, we show the effectiveness of the proposed learning scheme on another large dataset, the ChestX-ray14, surpassing current state-of-the-art techniques.

preprint2020arXiv

Association and Caching in Relay-Assisted mmWave Networks: From A Stochastic Geometry Perspective

Limited backhaul bandwidth and blockage effects are two main factors limiting the practical deployment of millimeter wave (mmWave) networks. To tackle these issues, we study the feasibility of relaying as well as caching in mmWave networks. A user association and relaying (UAR) criterion dependent on both caching status and maximum biased received power is proposed by considering the spatial correlation caused by the coexistence of base stations (BSs) and relay nodes (RNs). A joint UAR and caching placement problem is then formulated to maximize the backhaul offloading traffic. Using stochastic geometry tools, we decouple the joint UAR and caching placement problem by analyzing the relationship between UAR probabilities and caching placement probabilities. We then optimize the transformed caching placement problem based on polyblock outer approximation by exploiting the monotonic property in the general case and utilizing convex optimization in the noise-limited case. Accordingly, we propose a BS and RN selection algorithm where caching status at BSs and maximum biased received power are jointly considered. Experimental results demonstrate a significant enhancement of backhaul offloading using the proposed algorithms, and show that deploying more RNs and increasing cache size in mmWave networks is a more cost-effective alternative than increasing BS density to achieve similar backhaul offloading performance.

preprint2020arXiv

Fusing Motion Patterns and Key Visual Information for Semantic Event Recognition in Basketball Videos

Many semantic events in team sport activities e.g. basketball often involve both group activities and the outcome (score or not). Motion patterns can be an effective means to identify different activities. Global and local motions have their respective emphasis on different activities, which are difficult to capture from the optical flow due to the mixture of global and local motions. Hence it calls for a more effective way to separate the global and local motions. When it comes to the specific case for basketball game analysis, the successful score for each round can be reliably detected by the appearance variation around the basket. Based on the observations, we propose a scheme to fuse global and local motion patterns (MPs) and key visual information (KVI) for semantic event recognition in basketball videos. Firstly, an algorithm is proposed to estimate the global motions from the mixed motions based on the intrinsic property of camera adjustments. And the local motions could be obtained from the mixed and global motions. Secondly, a two-stream 3D CNN framework is utilized for group activity recognition over the separated global and local motion patterns. Thirdly, the basket is detected and its appearance features are extracted through a CNN structure. The features are utilized to predict the success or failure. Finally, the group activity recognition and success/failure prediction results are integrated using the kronecker product for event recognition. Experiments on NCAA dataset demonstrate that the proposed method obtains state-of-the-art performance.

preprint2016arXiv

Distortion Bounds for Transmitting Correlated Sources with Common Part over MAC

This paper investigates the joint source-channel coding problem of sending two correlated memoryless sources with common part over a memoryless multiple access channel (MAC). An inner bound and two outer bounds on the achievable distortion region are derived. In particular, they respectively recover the existing bounds for several special cases, such as communication without common part, lossless communication, and noiseless communication. When specialized to quadratic Gaussian communication case, transmitting Gaussian sources with Gaussian common part over Gaussian MAC, the inner bound and outer bound are used to generate two new bounds. Numerical result shows that common part improves the distortion of such distributed source-channel coding problem.

preprint2016arXiv

Load Coupling Power Optimization in Cloud Radio Access Networks

Recently, Cloud-based Radio Access Network (C-RAN) has been proposed as a potential solution to reduce energy cost in cellular networks. C-RAN centralizes the baseband processing capabilities of Base Stations (BSs) in a cloud computing platform in the form of BaseBand Unit (BBU) pool. In C-RAN, power consumed by the traditional BS system is distributed as wireless transmission power of the Remote Radio Heads (RRHs) and baseband processing power of the BBU pool. Different from previous work where wireless transmission power and baseband processing power are optimized individually and independently, this paper focuses on joint optimization of allocation for these two kinds of power and attempts to minimize the total power consumption subject to Quality of Service (QoS) requirements from users in terms of data rates. First, we exploit the load coupling model to express the coupling relations among power, load and user data rates. Based on the load coupling mode, we formulate the joint power optimization problem in C-RAN over both wireless transmission power and baseband processing power. Second, we prove that operating at full load may not be optimal in minimizing the total power consumption in C-RAN. Finally, we propose an efficient iterative algorithm to solve the target problem. Simulations have been performed to validate our theoretical and algorithmic work. The results show that the proposed algorithm outperforms existing schemes (without joint power optimization) in terms of power consumption.

preprint2016arXiv

Network Morphism

We present in this paper a systematic study on how to morph a well-trained neural network to a new one so that its network function can be completely preserved. We define this as \emph{network morphism} in this research. After morphing a parent network, the child network is expected to inherit the knowledge from its parent network and also has the potential to continue growing into a more powerful one with much shortened training time. The first requirement for this network morphism is its ability to handle diverse morphing types of networks, including changes of depth, width, kernel size, and even subnet. To meet this requirement, we first introduce the network morphism equations, and then develop novel morphing algorithms for all these morphing types for both classic and convolutional neural networks. The second requirement for this network morphism is its ability to deal with non-linearity in a network. We propose a family of parametric-activation functions to facilitate the morphing of any continuous non-linear activation neurons. Experimental results on benchmark datasets and typical neural networks demonstrate the effectiveness of the proposed network morphism scheme.

preprint2016arXiv

Resource Allocation in Dynamic TDD Heterogeneous Networks under Mixed Traffic

Recently, Dynamic Time Division Duplex (TDD) has been proposed to handle the asymmetry of traffic demand between DownLink (DL) and UpLink (UL) in Heterogeneous Networks (HetNets). However, for mixed traffic consisting of best effort traffic and soft Quality of Service (QoS) traffic, the resource allocation problem has not been adequately studied in Dynamic TDD HetNets. In this paper, we focus on such problem in a two-tier HetNet with co-channel deployment of one Macro cell Base Station (MBS) and multiple Small cell Base Stations (SBSs) in hotspots. Different from existing work, we introduce low power almost blank subframes to alleviate MBS-to-SBS interference which is inherent in TDD operation. To tackle the resource allocation problem, we propose a two-step strategy. First, from the view point of base stations, we propose a transmission protocol and perform time resource allocation by formulating and solving a network capacity maximization problem under DL/UL traffic demands. Second, from the view point of User Equipments (UEs), we formulate their resource allocation as a Network Utility Maximization (NUM) problem. An efficient iterative algorithm is proposed to solve the NUM problem. Simulations show the advantage of the proposed algorithm in terms of network throughput and UE QoS satisfaction level.

preprint2016arXiv

Storytelling of Photo Stream with Bidirectional Multi-thread Recurrent Neural Network

Visual storytelling aims to generate human-level narrative language (i.e., a natural paragraph with multiple sentences) from a photo streams. A typical photo story consists of a global timeline with multi-thread local storylines, where each storyline occurs in one different scene. Such complex structure leads to large content gaps at scene transitions between consecutive photos. Most existing image/video captioning methods can only achieve limited performance, because the units in traditional recurrent neural networks (RNN) tend to "forget" the previous state when the visual sequence is inconsistent. In this paper, we propose a novel visual storytelling approach with Bidirectional Multi-thread Recurrent Neural Network (BMRNN). First, based on the mined local storylines, a skip gated recurrent unit (sGRU) with delay control is proposed to maintain longer range visual information. Second, by using sGRU as basic units, the BMRNN is trained to align the local storylines into the global sequential timeline. Third, a new training scheme with a storyline-constrained objective function is proposed by jointly considering both global and local matches. Experiments on three standard storytelling datasets show that the BMRNN model outperforms the state-of-the-art methods.

preprint2015arXiv

Attribute-Based Multi-Dimensional Scalable Access Control For Social Media Sharing

Media sharing is an extremely popular paradigm of social interaction in online social networks (OSNs) nowadays. The scalable media access control is essential to perform information sharing among users with various access privileges. In this paper, we present a multi-dimensional scalable media access control (MD-SMAC) system based on the proposed scalable ciphertext policy attribute-based encryption (SCP-ABE) algorithm. In the proposed MD-SMAC system, fine-grained access control can be performed on the media contents encoded in a multi-dimensional scalable manner based on data consumers' diverse attributes. Through security analysis, we show that the proposed MC-SMAC system is able to resist collusion attacks. Additionally, we conduct experiments to evaluate the efficiency performance of the proposed system, especially on mobile devices.

preprint2015arXiv

Service Provisioning and Profit Maximization in Network-assisted Adaptive HTTP Streaming

Adaptive HTTP streaming with centralized consideration of multiple streams has gained increasing interest. It poses a special challenge that the interests of both content provider and network operator need to be deliberately balanced. More importantly, the adaptation strategy is required to be flexible enough to be ported to various systems that work under different network environments, QoE levels, and economic objectives. To address these challenges, we propose a Markov Decision Process (MDP) based network-assisted adaptation framework, wherein cost of buffering, significant playback variation, bandwidth management and income of playback are jointly investigated. We then demonstrate its promising service provisioning and maximal profit for a mobile network in which fair or differentiated service is required.

Chang Wen Chen

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling

Taking an Emotional Look at Video Paragraph Captioning

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Beyond Fine-tuning: Classifying High Resolution Mammograms using Function-Preserving Transformations

Association and Caching in Relay-Assisted mmWave Networks: From A Stochastic Geometry Perspective

Fusing Motion Patterns and Key Visual Information for Semantic Event Recognition in Basketball Videos

Distortion Bounds for Transmitting Correlated Sources with Common Part over MAC

Load Coupling Power Optimization in Cloud Radio Access Networks

Network Morphism

Resource Allocation in Dynamic TDD Heterogeneous Networks under Mixed Traffic

Storytelling of Photo Stream with Bidirectional Multi-thread Recurrent Neural Network

Attribute-Based Multi-Dimensional Scalable Access Control For Social Media Sharing

Service Provisioning and Profit Maximization in Network-assisted Adaptive HTTP Streaming