Source author record

Chengfei Liu

Chengfei Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Social and Information Networks Computer Vision Data Structures and Algorithms Information Retrieval Machine Learning

Catalog footprint

What is connected

10works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Multi-stage feature decorrelation constraints for improving CNN classification performance

For the convolutional neural network (CNN) used for pattern classification, the training loss function is usually applied to the final output of the network, except for some regularization constraints on the network parameters. However, with the increasing of the number of network layers, the influence of the loss function on the network front layers gradually decreases, and the network parameters tend to fall into local optimization. At the same time, it is found that the trained network has significant information redundancy at all stages of features, which reduces the effectiveness of feature mapping at all stages and is not conducive to the change of the subsequent parameters of the network in the direction of optimality. Therefore, it is possible to obtain a more optimized solution of the network and further improve the classification accuracy of the network by designing a loss function for restraining the front stage features and eliminating the information redundancy of the front stage features .For CNN, this article proposes a multi-stage feature decorrelation loss (MFD Loss), which refines effective features and eliminates information redundancy by constraining the correlation of features at all stages. Considering that there are many layers in CNN, through experimental comparison and analysis, MFD Loss acts on multiple front layers of CNN, constrains the output features of each layer and each channel, and performs supervision training jointly with classification loss function during network training. Compared with the single Softmax Loss supervised learning, the experiments on several commonly used datasets on several typical CNNs prove that the classification performance of Softmax Loss+MFD Loss is significantly better. Meanwhile, the comparison experiments before and after the combination of MFD Loss and some other typical loss functions verify its good universality.

preprint2022arXiv

CHIEF: Clustering with Higher-order Motifs in Big Networks

Clustering a group of vertices in networks facilitates applications across different domains, such as social computing and Internet of Things. However, challenges arises for clustering networks with increased scale. This paper proposes a solution which consists of two motif clustering techniques: standard acceleration CHIEF-ST and approximate acceleration CHIEF-AP. Both algorithms first find the maximal k-edge-connected subgraphs within the target networks to lower the network scale, then employ higher-order motifs in clustering. In the first procedure, we propose to lower the network scale by optimizing the network structure with maximal k-edge-connected subgraphs. For CHIEF-ST, we illustrate that all target motifs will be kept after this procedure when the minimum node degree of the target motif is equal or greater than k. For CHIEF-AP, we prove that the eigenvalues of the adjacency matrix and the Laplacian matrix are relatively stable after this step. That is, CHIEF-ST has no influence on motif clustering, whereas CHIEF-AP introduces limited yet acceptable impact. In the second procedure, we employ higher-order motifs, i.e., heterogeneous four-node motifs clustering in higher-order dense networks. The contributions of CHIEF are two-fold: (1) improved efficiency of motif clustering for big networks; (2) verification of higher-order motif significance. The proposed solutions are found to outperform baseline approaches according to experiments on real and synthetic networks, which demonstrates CHIEF's strength in large network analysis. Meanwhile, higher-order motifs are proved to perform better than traditional triangle motifs in clustering.

preprint2020arXiv

Efficient Exact Algorithms for Maximum Balanced Biclique Search in Bipartite Graphs

Given a bipartite graph, the maximum balanced biclique (\textsf{MBB}) problem, discovering a mutually connected while equal-sized disjoint sets with the maximum cardinality, plays a significant role for mining the bipartite graph and has numerous applications. Despite the NP-hardness of the \textsf{MBB} problem, in this paper, we show that an exact \textsf{MBB} can be discovered extremely fast in bipartite graphs for real applications. We propose two exact algorithms dedicated for dense and sparse bipartite graphs respectively. For dense bipartite graphs, an $\mathcal{O}^{*}( 1.3803^{n})$ algorithm is proposed. This algorithm in fact can find an \textsf{MBB} in near polynomial time for dense bipartite graphs that are common for applications such as VLSI design. This is because, using our proposed novel techniques, the search can fast converge to sufficiently dense bipartite graphs which we prove to be polynomially solvable. For large sparse bipartite graphs typical for applications such as biological data analysis, an $\mathcal{O}^{*}( 1.3803^{\ddotδ})$ algorithm is proposed, where $\ddotδ$ is only a few hundreds for large sparse bipartite graphs with millions of vertices. The indispensible optimizations that lead to this time complexity are: we transform a large sparse bipartite graph into a limited number of dense subgraphs with size up to $\ddotδ$ and then apply our proposed algorithm for dense bipartite graphs on each of the subgraphs. To further speed up this algorithm, tighter upper bounds, faster heuristics and effective reductions are proposed, allowing an \textsf{MBB} to be discovered within a few seconds for bipartite graphs with millions of vertices. Extensive experiments are conducted on synthetic and real large bipartite graphs to demonstrate the efficiency and effectiveness of our proposed algorithms and techniques.

preprint2020arXiv

Index-based Solutions for Efficient Density Peak Clustering

Density Peak Clustering (DPC), a popular density-based clustering approach, has received considerable attention from the research community primarily due to its simplicity and fewer-parameter requirement. However, the resultant clusters obtained using DPC are influenced by the sensitive parameter $d_c$, which depends on data distribution and requirements of different users. Besides, the original DPC algorithm requires visiting a large number of objects, making it slow. To this end, this paper investigates index-based solutions for DPC. Specifically, we propose two list-based index methods viz. (i) a simple List Index, and (ii) an advanced Cumulative Histogram Index. Efficient query algorithms are proposed for these indices which significantly avoids irrelevant comparisons at the cost of space. For memory-constrained systems, we further introduce an approximate solution to the above indices which allows substantial reduction in the space cost, provided that slight inaccuracies are admissible. Furthermore, owing to considerably lower memory requirements of existing tree-based index structures, we also present effective pruning techniques and efficient query algorithms to support DPC using the popular Quadtree Index and R-tree Index. Finally, we practically evaluate all the above indices and present the findings and results, obtained from a set of extensive experiments on six synthetic and real datasets. The experimental insights obtained can help to guide in selecting a befitting index.

preprint2014arXiv

Efficient Truss Maintenance in Evolving Networks

Truss was proposed to study social network data represented by graphs. A k-truss of a graph is a cohesive subgraph, in which each edge is contained in at least k-2 triangles within the subgraph. While truss has been demonstrated as superior to model the close relationship in social networks and efficient algorithms for finding trusses have been extensively studied, very little attention has been paid to truss maintenance. However, most social networks are evolving networks. It may be infeasible to recompute trusses from scratch from time to time in order to find the up-to-date $k$-trusses in the evolving networks. In this paper, we discuss how to maintain trusses in a graph with dynamic updates. We first discuss a set of properties on maintaining trusses, then propose algorithms on maintaining trusses on edge deletions and insertions, finally, we discuss truss index maintenance. We test the proposed techniques on real datasets. The experiment results show the promise of our work.

preprint2013arXiv

Context-based Diversification for Keyword Queries over XML Data

While keyword query empowers ordinary users to search vast amount of data, the ambiguity of keyword query makes it difficult to effectively answer keyword queries, especially for short and vague keyword queries. To address this challenging problem, in this paper we propose an approach that automatically diversifies XML keyword search based on its different contexts in the XML data. Given a short and vague keyword query and XML data to be searched, we firstly derive keyword search candidates of the query by a classifical feature selection model. And then, we design an effective XML keyword search diversification model to measure the quality of each candidate. After that, three efficient algorithms are proposed to evaluate the possible generated query candidates representing the diversified search intentions, from which we can find and return top-$k$ qualified query candidates that are most relevant to the given keyword query while they can cover maximal number of distinct results.At last, a comprehensive evaluation on real and synthetic datasets demonstrates the effectiveness of our proposed diversification model and the efficiency of our algorithms.

preprint2013arXiv

Quasi-SLCA based Keyword Query Processing over Probabilistic XML Data

The probabilistic threshold query is one of the most common queries in uncertain databases, where a result satisfying the query must be also with probability meeting the threshold requirement. In this paper, we investigate probabilistic threshold keyword queries (PrTKQ) over XML data, which is not studied before. We first introduce the notion of quasi-SLCA and use it to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilistic inverted (PI) index that can be used to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-based Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our approaches.

preprint2013arXiv

Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce

In this paper we study how to efficiently compute \textit{frequent co-occurring terms} (FCT) in the results of a keyword query in parallel using the popular MapReduce framework. Taking as input a keyword query q and an integer k, an FCT query reports the k terms that are not in q, but appear most frequently in the results of the keyword query q over multiple joined relations. The returned terms of FCT search can be used to do query expansion and query refinement for traditional keyword search. Different from the method of FCT search in a single platform, our proposed approach can efficiently answer a FCT query using the MapReduce Paradigm without pre-computing the results of the original keyword query, which is run in parallel platform. In this work, we can output the final FCT search results by two MapReduce jobs: the first is to extract the statistical information of the data; and the second is to calculate the total frequency of each term based on the output of the first job. At the two MapReduce jobs, we would guarantee the load balance of mappers and the computational balance of reducers as much as possible. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark datasets with different sizes.

preprint2013arXiv

Update XML Views

View update is the problem of translating an update to a view to some updates to the source data of the view. In this paper, we show the factors determining XML view update translation, propose a translation procedure, and propose translated updates to the source document for different types of views. We further show that the translated updates are precise. The proposed solution makes it possible for users who do not have access privileges to the source data to update the source data via a view.

preprint2011arXiv

ELCA Evaluation for Keyword Search on Probabilistic XML Data

As probabilistic data management is becoming one of the main research focuses and keyword search is turning into a more popular query means, it is natural to think how to support keyword queries on probabilistic XML data. With regards to keyword query on deterministic XML documents, ELCA (Exclusive Lowest Common Ancestor) semantics allows more relevant fragments rooted at the ELCAs to appear as results and is more popular compared with other keyword query result semantics (such as SLCAs). In this paper, we investigate how to evaluate ELCA results for keyword queries on probabilistic XML documents. After defining probabilistic ELCA semantics in terms of possible world semantics, we propose an approach to compute ELCA probabilities without generating possible worlds. Then we develop an efficient stack-based algorithm that can find all probabilistic ELCA results and their ELCA probabilities for a given keyword query on a probabilistic XML document. Finally, we experimentally evaluate the proposed ELCA algorithm and compare it with its SLCA counterpart in aspects of result effectiveness, time and space efficiency, and scalability.

Chengfei Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Multi-stage feature decorrelation constraints for improving CNN classification performance

CHIEF: Clustering with Higher-order Motifs in Big Networks

Efficient Exact Algorithms for Maximum Balanced Biclique Search in Bipartite Graphs

Index-based Solutions for Efficient Density Peak Clustering

Efficient Truss Maintenance in Evolving Networks

Context-based Diversification for Keyword Queries over XML Data

Quasi-SLCA based Keyword Query Processing over Probabilistic XML Data

Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce

Update XML Views

ELCA Evaluation for Keyword Search on Probabilistic XML Data