Source author record

Hongzhi Wang

Hongzhi Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Machine Learning Artificial Intelligence Computation Computation and Language Computer Vision eess.IV Methodology Quantitative Methods Social and Information Networks

Catalog footprint

What is connected

25works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model

Multimodal aspect-based sentiment analysis (MABSA) aims to identify aspect-level sentiments by jointly modeling textual and visual information, which is essential for fine-grained opinion understanding in social media. Existing approaches mainly rely on discriminative classification with complex multimodal fusion, yet lacking explicit sentiment explainability. In this paper, we reformulate MABSA as a generative and explainable task, proposing a unified framework that simultaneously predicts aspect-level sentiment and generates natural language explanations. Based on multimodal large language models (MLLMs), our approach employs a prompt-based generative paradigm, jointly producing sentiment and explanation. To further enhance aspect-oriented reasoning capabilities, we propose a dependency-syntax-guided sentiment cue strategy. This strategy prunes and textualizes the aspect-centered dependency syntax tree, guiding the model to distinguish different sentiment aspects and enhancing its explainability. To enable explainability, we use MLLMs to construct new datasets with sentiment explanations to fine-tune. Experiments show that our approach not only achieves consistent gains in sentiment classification accuracy, but also produces faithful, aspect-grounded explanations.

preprint2022arXiv

AAE: An Active Auto-Estimator for Improving Graph Storage

Nowadays, graph becomes an increasingly popular model in many real applications. The efficiency of graph storage is crucial for these applications. Generally speaking, the tune tasks of graph storage rely on the database administrators (DBAs) to find the best graph storage. However, DBAs make the tune decisions by mainly relying on their experiences and intuition. Due to the limitations of DBAs's experiences, the tunes may have an uncertain performance and conduct worse efficiency. In this paper, we observe that an estimator of graph workload has the potential to guarantee the performance of tune operations. Unfortunately, because of the complex characteristics of graph evaluation task, there exists no mature estimator for graph workload. We formulate the evaluation task of graph workload as a classification task and carefully design the feature engineering process, including graph data features, graph workload features and graph storage features. Considering the complex features of graph and the huge time consumption in graph workload execution, it is difficult for the graph workload estimator to obtain enough training set. So, we propose an active auto-estimator (AAE) for the graph workload evaluation by combining the active learning and deep learning. AAE could achieve good evaluation efficiency with limited training set. We test the time efficiency and evaluation accuracy of AAE with two open source graph data, LDBC and Freebase. Experimental results show that our estimator could efficiently complete the graph workload evaluation in milliseconds.

preprint2022arXiv

AutoTS: Automatic Time Series Forecasting Model Design Based on Two-Stage Pruning

Automatic Time Series Forecasting (TSF) model design which aims to help users to efficiently design suitable forecasting model for the given time series data scenarios, is a novel research topic to be urgently solved. In this paper, we propose AutoTS algorithm trying to utilize the existing design skills and design efficient search methods to effectively solve this problem. In AutoTS, we extract effective design experience from the existing TSF works. We allow the effective combination of design experience from different sources, so as to create an effective search space containing a variety of TSF models to support different TSF tasks. Considering the huge search space, in AutoTS, we propose a two-stage pruning strategy to reduce the search difficulty and improve the search efficiency. In addition, in AutoTS, we introduce the knowledge graph to reveal associations between module options. We make full use of these relational information to learn higher-level features of each module option, so as to further improve the search quality. Extensive experimental results show that AutoTS is well-suited for the TSF area. It is more efficient than the existing neural architecture search algorithms, and can quickly design powerful TSF model better than the manually designed ones.

preprint2022arXiv

EEML: Ensemble Embedded Meta-learning

To accelerate learning process with few samples, meta-learning resorts to prior knowledge from previous tasks. However, the inconsistent task distribution and heterogeneity is hard to be handled through a global sharing model initialization. In this paper, based on gradient-based meta-learning, we propose an ensemble embedded meta-learning algorithm (EEML) that explicitly utilizes multi-model-ensemble to organize prior knowledge into diverse specific experts. We rely on a task embedding cluster mechanism to deliver diverse tasks to matching experts in training process and instruct how experts collaborate in test phase. As a result, the multi experts can focus on their own area of expertise and cooperate in upcoming task to solve the task heterogeneity. The experimental results show that the proposed method outperforms recent state-of-the-arts easily in few-shot learning problem, which validates the importance of differentiation and cooperation.

preprint2022arXiv

TPAD: Identifying Effective Trajectory Predictions Under the Guidance of Trajectory Anomaly Detection Model

Trajectory Prediction (TP) is an important research topic in computer vision and robotics fields. Recently, many stochastic TP models have been proposed to deal with this problem and have achieved better performance than the traditional models with deterministic trajectory outputs. However, these stochastic models can generate a number of future trajectories with different qualities. They are lack of self-evaluation ability, that is, to examine the rationality of their prediction results, thus failing to guide users to identify high-quality ones from their candidate results. This hinders them from playing their best in real applications. In this paper, we make up for this defect and propose TPAD, a novel TP evaluation method based on the trajectory Anomaly Detection (AD) technique. In TPAD, we firstly combine the Automated Machine Learning (AutoML) technique and the experience in the AD and TP field to automatically design an effective trajectory AD model. Then, we utilize the learned trajectory AD model to examine the rationality of the predicted trajectories, and screen out good TP results for users. Extensive experimental results demonstrate that TPAD can effectively identify near-optimal prediction results, improving stochastic TP models' practical application effect.

preprint2021arXiv

Approximate Query Processing for Group-By Queries based on Conditional Generative Models

The Group-By query is an important kind of query, which is common and widely used in data warehouses, data analytics, and data visualization. Approximate query processing is an effective way to increase the querying efficiency on big data. The answer to a group-by query involves multiple values, which makes it difficult to provide sufficiently accurate estimations for all the groups. Stratified sampling improves the accuracy compared with the uniform sampling, but the samples chosen for some special queries cannot work for other queries. Online sampling chooses samples for the given query at query time, but it requires a long latency. Thus, it is a challenge to achieve both accuracy and efficiency at the same time. Facing such challenge, in this work, we propose a sample generation framework based on a conditional generative model. The sample generation framework can generate any number of samples for the given query without accessing the data. The proposed framework based on the lightweight model can be combined with stratified sampling and online aggregation to improve the estimation accuracy for group-by queries. The experimental results show that our proposed methods are both efficient and accurate.

preprint2021arXiv

ConsciousControlFlow(CCF): A Demonstration for conscious Artificial Intelligence

In this demo, we present ConsciousControlFlow(CCF), a prototype system to demonstrate conscious Artificial Intelligence (AI). The system is based on the computational model for consciousness and the hierarchy of needs. CCF supports typical scenarios to show the behaviors and the mental activities of conscious AI. We demonstrate that CCF provides a useful tool for effective machine consciousness demonstration and human behavior study assistance.

preprint2021arXiv

Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data

The demand for high-performance anomaly detection techniques of IoT data becomes urgent, especially in industry field. The anomaly identification and explanation in time series data is one essential task in IoT data mining. Since that the existing anomaly detection techniques focus on the identification of anomalies, the explanation of anomalies is not well-solved. We address the anomaly explanation problem for multivariate IoT data and propose a 3-step self-contained method in this paper. We formalize and utilize the domain knowledge in our method, and identify the anomalies by the violation of constraints. We propose set-cover-based anomaly explanation algorithms to discover the anomaly events reflected by violation features, and further develop knowledge update algorithms to improve the original knowledge set. Experimental results on real datasets from large-scale IoT systems verify that our method computes high-quality explanation solutions of anomalies. Our work provides a guide to navigate the explicable anomaly detection in both IoT fault diagnosis and temporal data cleaning.

preprint2021arXiv

Misplaced Subsequences Repairing with Application to Multivariate Industrial Time Series Data

Both the volume and the collection velocity of time series generated by monitoring sensors are increasing in the Internet of Things (IoT). Data management and analysis requires high quality and applicability of the IoT data. However, errors are prevalent in original time series data. Inconsistency in time series is a serious data quality problem existing widely in IoT. Such problem could be hardly solved by existing techniques. Motivated by this, we define an inconsistent subsequences problem in multivariate time series, and propose an integrity data repair approach to solve inconsistent problems. Our proposed repairing method consists of two parts: (1) we design effective anomaly detection method to discover latent inconsistent subsequences in the IoT time series; and (2) we develop repair algorithms to precisely locate the start and finish time of inconsistent intervals, and provide reliable repairing strategies. A thorough experiment on two real-life datasets verifies the superiority of our method compared to other practical approaches. Experimental results also show that our method captures and repairs inconsistency problems effectively in industrial time series in complex IIoT scenarios.

preprint2021arXiv

Musings about Constructions of Efficient Latin Hypercube Designs with Flexible Run-sizes

Efficient Latin hypercube designs (LHDs), including maximin distance LHDs, maximum projection LHDs and orthogonal LHDs, are widely used in computer experiments. It is challenging to construct such designs with flexible sizes, especially for large ones. In the current literature, various algebraic methods and search algorithms have been proposed for identifying efficient LHDs, each having its own pros and cons. In this paper, we review, summarize and compare some currently popular methods aiming to provide guidance for experimenters on what method should be used in practice. Using the R package we developed which integrates and improves various algebraic and searching methods, many of the designs found in this paper are better than the existing ones. They are easy to use for practitioners and can serve as benchmarks for the future developments on LHDs.

preprint2020arXiv

Auto-CASH: Autonomous Classification Algorithm Selection with Deep Q-Network

The great amount of datasets generated by various data sources have posed the challenge to machine learning algorithm selection and hyperparameter configuration. For a specific machine learning task, it usually takes domain experts plenty of time to select an appropriate algorithm and configure its hyperparameters. If the problem of algorithm selection and hyperparameter optimization can be solved automatically, the task will be executed more efficiently with performance guarantee. Such problem is also known as the CASH problem. Early work either requires a large amount of human labor, or suffers from high time or space complexity. In our work, we present Auto-CASH, a pre-trained model based on meta-learning, to solve the CASH problem more efficiently. Auto-CASH is the first approach that utilizes Deep Q-Network to automatically select the meta-features for each dataset, thus reducing the time cost tremendously without introducing too much human labor. To demonstrate the effectiveness of our model, we conduct extensive experiments on 120 real-world classification datasets. Compared with classical and the state-of-art CASH approaches, experimental results show that Auto-CASH achieves better performance within shorter time.

preprint2020arXiv

Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem

In many fields, a mass of algorithms with completely different hyperparameters have been developed to address the same type of problems. Choosing the algorithm and hyperparameter setting correctly can promote the overall performance greatly, but users often fail to do so due to the absence of knowledge. How to help users to effectively and quickly select the suitable algorithm and hyperparameter settings for the given task instance is an important research topic nowadays, which is known as the CASH problem. In this paper, we design the Auto-Model approach, which makes full use of known information in the related research paper and introduces hyperparameter optimization techniques, to solve the CASH problem effectively. Auto-Model tremendously reduces the cost of algorithm implementations and hyperparameter configuration space, and thus capable of dealing with the CASH problem efficiently and easily. To demonstrate the benefit of Auto-Model, we compare it with classical Auto-Weka approach. The experimental results show that our proposed approach can provide superior results and achieves better performance in a short time.

preprint2020arXiv

Automatic Hyper-Parameter Optimization Based on Mapping Discovery from Data to Hyper-Parameters

Machine learning algorithms have made remarkable achievements in the field of artificial intelligence. However, most machine learning algorithms are sensitive to the hyper-parameters. Manually optimizing the hyper-parameters is a common method of hyper-parameter tuning. However, it is costly and empirically dependent. Automatic hyper-parameter optimization (autoHPO) is favored due to its effectiveness. However, current autoHPO methods are usually only effective for a certain type of problems, and the time cost is high. In this paper, we propose an efficient automatic parameter optimization approach, which is based on the mapping from data to the corresponding hyper-parameters. To describe such mapping, we propose a sophisticated network structure. To obtain such mapping, we develop effective network constrution algorithms. We also design strategy to optimize the result futher during the application of the mapping. Extensive experimental results demonstrate that the proposed approaches outperform the state-of-the-art apporaches significantly.

preprint2020arXiv

Automatic Storage Structure Selection for hybrid Workload

In the use of database systems, the design of the storage engine and data model directly affects the performance of the database when performing queries. Therefore, the users of the database need to select the storage engine and design data model according to the workload encountered. However, in a hybrid workload, the query set of the database is dynamically changing, and the design of its optimal storage structure is also changing. Motivated by this, we propose an automatic storage structure selection system based on learning cost, which is used to dynamically select the optimal storage structure of the database under hybrid workloads. In the system, we introduce a machine learning method to build a cost model for the storage engine, and a column-oriented data layout generation algorithm. Experimental results show that the proposed system can choose the optimal combination of storage engine and data model according to the current workload, which greatly improves the performance of the default storage structure. And the system is designed to be compatible with different storage engines for easy use in practical applications.

preprint2020arXiv

ExperienceThinking: Constrained Hyperparameter Optimization based on Knowledge and Pruning

Machine learning algorithms are very sensitive to the hyperparameters, and their evaluations are generally expensive. Users desperately need intelligent methods to quickly optimize hyperparameter settings according to known evaluation information, and thus reduce computational cost and promote optimization efficiency. Motivated by this, we propose ExperienceThinking algorithm to quickly find the best possible hyperparameter configuration of machine learning algorithms within a few configuration evaluations. ExperienceThinking design two novel methods, which intelligently infer optimal configurations from two aspects: search space pruning and knowledge utilization respectively. Two methods complement each other and solve the constrained hyperparameter optimization problems effectively. To demonstrate the benefit of ExperienceThinking, we compare it with 3 classical hyperparameter optimization algorithms with a small number of configuration evaluations. The experimental results present that our proposed algorithm provides superior results and achieve better performance.

preprint2020arXiv

Index Selection for NoSQL Database with Deep Reinforcement Learning

We propose a new approach of NoSQL database index selection. For different workloads, we select different indexes and their different parameters to optimize the database performance. The approach builds a deep reinforcement learning model to select an optimal index for a given fixed workload and adapts to a changing workload. Experimental results show that, Deep Reinforcement Learning Index Selection Approach (DRLISA) has improved performance to varying degrees according to traditional single index structures.

preprint2020arXiv

LAQP: Learning-based Approximate Query Processing

Querying on big data is a challenging task due to the rapid growth of data amount. Approximate query processing (AQP) is a way to meet the requirement of fast response. In this paper, we propose a learning-based AQP method called the LAQP. The LAQP builds an error model learned from the historical queries to predict the sampling-based estimation error of each new query. It makes a combination of the sampling-based AQP, the pre-computed aggregations and the learned error model to provide high-accurate query estimations with a small off-line sample. The experimental results indicate that our LAQP outperforms the sampling-based AQP, the pre-aggregation-based AQP and the most recent learning-based AQP method.

preprint2020arXiv

Multi-Objective Neural Architecture Search Based on Diverse Structures and Adaptive Recommendation

The search space of neural architecture search (NAS) for convolutional neural network (CNN) is huge. To reduce searching cost, most NAS algorithms use fixed outer network level structure, and search the repeatable cell structure only. Such kind of fixed architecture performs well when enough cells and channels are used. However, when the architecture becomes more lightweight, the performance decreases significantly. To obtain better lightweight architectures, more flexible and diversified neural architectures are in demand, and more efficient methods should be designed for larger search space. Motivated by this, we propose MoARR algorithm, which utilizes the existing research results and historical information to quickly find architectures that are both lightweight and accurate. We use the discovered high-performance cells to construct network architectures. This method increases the network architecture diversity while also reduces the search space of cell structure design. In addition, we designs a novel multi-objective method to effectively analyze the historical evaluation information, so as to efficiently search for the Pareto optimal architectures with high accuracy and small parameter number. Experimental results show that our MoARR can achieve a powerful and lightweight model (with 1.9% error rate and 2.3M parameters) on CIFAR-10 in 6 GPU hours, which is better than the state-of-the-arts. The explored architecture is transferable to ImageNet and achieves 76.0% top-1 accuracy with 4.9M parameters.

preprint2020arXiv

Neural Network Segmentation of Cell Ultrastructure Using Incomplete Annotation

The Pancreatic beta cell is an important target in diabetes research. For scalable modeling of beta cell ultrastructure, we investigate automatic segmentation of whole cell imaging data acquired through soft X-ray tomography. During the course of the study, both complete and partial ultrastructure annotations were produced manually for different subsets of the data. To more effectively use existing annotations, we propose a method that enables the application of partially labeled data for full label segmentation. For experimental validation, we apply our method to train a convolutional neural network with a set of 12 fully annotated data and 12 partially annotated data and show promising improvement over standard training that uses fully annotated data alone.

preprint2020arXiv

Reachability Queries with Label and Substructure Constraints on Knowledge Graphs

Since knowledge graphs (KGs) describe and model the relationships between entities and concepts in the real world, reasoning on KGs often correspond to the reachability queries with label and substructure constraints (LSCR). Specially, for a search path p, LSCR queries not only require that the labels of the edges passed by p are in a certain label set, but also claim that a vertex in p could satisfy a certain substructure constraint. LSCR queries is much more complex than the label-constraint reachability (LCR) queries, and there is no efficient solution for LSCR queries on KGs, to the best of our knowledge. Motivated by this, we introduce two solutions for such queries on KGs, UIS and INS. The former can also be utilized for general edge-labeled graphs, and is relatively handy for practical implementation. The latter is an efficient local-index-based informed search strategy. An extensive experimental evaluation, on both synthetic and real KGs, illustrates that our solutions can efficiently process LSCR queries on KGs.

preprint2016arXiv

Data Source Selection for Information Integration in Big Data Era

In Big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and even impossible to access all data sources. Data Source selection should consider both efficiency and effectiveness issues. For efficiency, the approach should achieve high performance and be scalability to fit large data source amount. From effectiveness aspect, data quality and overlapping of sources are to be considered, since data quality varies much from data sources, with significant differences in the accuracy and coverage of the data provided, and the overlapping of sources can even lower the quality of data integrated from selected data sources. In this paper, we study source selection problem in \textit{Big Data Era} and propose methods which can scale to datasets with up to millions of data sources and guarantee the quality of results. Motivated by this, we propose a new object function taking the expected number of true values a source can provide as a criteria to evaluate the contribution of a data source. Based on our proposed index we present a scalable algorithm and two pruning strategies to improve the efficiency without sacrificing precision. Experimental results on both real world and synthetic data sets show that our methods can select sources providing a large proportion of true values efficiently and can scale to massive data sources.

preprint2016arXiv

Efficient Entity Resolution on Heterogeneous Records

Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. Specifically, the schemas of records may differ from each other. To leverage such records better, most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema. However, we observe that schema matching would lose information in some cases, which could be useful or even crucial to ER. To leverage sufficient information from heterogeneous sources, in this paper, we address several challenges of ER on heterogeneous records and show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings. Motivated by this, we design the similarity function and propose a novel framework to iteratively find records which refer to the same entity. Regarding efficiency, we build an index to generate candidates and accelerate similarity computation. Evaluations on real-world datasets show the effectiveness and efficiency of our methods.

preprint2016arXiv

Efficient Web-based Data Imputation with Graph Model

A challenge for data imputation is the lack of knowledge. In this paper, we attempt to address this challenge by involving extra knowledge from web. To achieve high-performance web-based imputation, we use the dependency, i.e.FDs and CFDs, to impute as many as possible values automatically and fill in the other missing values with the minimal access of web, whose cost is relatively large. To make sufficient use of dependencies, We model the dependency set on the data as a graph and perform automatical imputation and keywords generation for web-based imputation based on such graph model. With the generated keywords, we design two algorithms to extract values for imputation from the search results. Extensive experimental results based on real-world data collections show that the proposed approach could impute missing values efficiently and effectively compared to existing approach.

preprint2015arXiv

Efficient Influence Maximization in Weighted Independent Cascade Model

Influence maximization(IM) problem is to find a seed set in a social network which achieves the maximal influence spread. This problem plays an important role in viral marketing. Numerous models have been proposed to solve this problem. However, none of them considers the attributes of nodes. Paying all attention to the structure of network causes some trouble applying these models to real-word applications. Motivated by this, we present weighted independent cascade (WIC) model, a novel cascade model which extends the applicability of independent cascade(IC) model by attaching attributes to the nodes. The IM problem in WIC model is to maximize the value of nodes which are influenced. This problem is NP-hard. To solve this problem, we present a basic greedy algorithm and Weight Reset(WR) algorithm. Moreover, we propose Bounded Weight Reset(BWR) algorithm to make further effort to improve the efficiency by bounding the diffusion node influence. We prove that BWR is a fully polynomial-time approximation scheme(FPTAS). Experimentally, we show that with additional node attribute, the solution achieved by WIC model outperforms that of IC model in nearly 90%. The experimental results show that BWR can achieve excellent approximation and faster than greedy algorithm more than three orders of magnitude with little sacrifice of accuracy. Especially, BWR can handle large networks with millions of nodes in several tens of seconds while keeping rather high accuracy. Such result demonstrates that BWR can solve IM problem effectively and efficiently.

preprint2012arXiv

Efficient Subgraph Matching on Billion Node Graphs

The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.

Hongzhi Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

25 published item(s)

Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model

AAE: An Active Auto-Estimator for Improving Graph Storage

AutoTS: Automatic Time Series Forecasting Model Design Based on Two-Stage Pruning

EEML: Ensemble Embedded Meta-learning

TPAD: Identifying Effective Trajectory Predictions Under the Guidance of Trajectory Anomaly Detection Model

Approximate Query Processing for Group-By Queries based on Conditional Generative Models

ConsciousControlFlow(CCF): A Demonstration for conscious Artificial Intelligence

Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data

Misplaced Subsequences Repairing with Application to Multivariate Industrial Time Series Data

Musings about Constructions of Efficient Latin Hypercube Designs with Flexible Run-sizes

Auto-CASH: Autonomous Classification Algorithm Selection with Deep Q-Network

Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem

Automatic Hyper-Parameter Optimization Based on Mapping Discovery from Data to Hyper-Parameters

Automatic Storage Structure Selection for hybrid Workload

ExperienceThinking: Constrained Hyperparameter Optimization based on Knowledge and Pruning

Index Selection for NoSQL Database with Deep Reinforcement Learning

LAQP: Learning-based Approximate Query Processing

Multi-Objective Neural Architecture Search Based on Diverse Structures and Adaptive Recommendation

Neural Network Segmentation of Cell Ultrastructure Using Incomplete Annotation

Reachability Queries with Label and Substructure Constraints on Knowledge Graphs

Data Source Selection for Information Integration in Big Data Era

Efficient Entity Resolution on Heterogeneous Records

Efficient Web-based Data Imputation with Graph Model

Efficient Influence Maximization in Weighted Independent Cascade Model

Efficient Subgraph Matching on Billion Node Graphs