Researcher profile

Fan Min

Fan Min contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2013arXiv

Characteristic matrix of covering and its application to boolean matrix decomposition and axiomatization

Covering is an important type of data structure while covering-based rough sets provide an efficient and systematic theory to deal with covering data. In this paper, we use boolean matrices to represent and axiomatize three types of covering approximation operators. First, we define two types of characteristic matrices of a covering which are essentially square boolean ones, and their properties are studied. Through the characteristic matrices, three important types of covering approximation operators are concisely equivalently represented. Second, matrix representations of covering approximation operators are used in boolean matrix decomposition. We provide a sufficient and necessary condition for a square boolean matrix to decompose into the boolean product of another one and its transpose. And we develop an algorithm for this boolean matrix decomposition. Finally, based on the above results, these three types of covering approximation operators are axiomatized using boolean matrices. In a word, this work borrows extensively from boolean matrices and present a new view to study covering-based rough sets.

preprint2013arXiv

Cold-start recommendation through granular association rules

Recommender systems are popular in e-commerce as they suggest items of interest to users. Researchers have addressed the cold-start problem where either the user or the item is new. However, the situation with both new user and new item has seldom been considered. In this paper, we propose a cold-start recommendation approach to this situation based on granular association rules. Specifically, we provide a means for describing users and items through information granules, a means for generating association rules between users and items, and a means for recommending items to users using these rules. Experiments are undertaken on a publicly available dataset MovieLens. Results indicate that rule sets perform similarly on the training and the testing sets, and the appropriate setting of granule is essential to the application of granular association rules.

preprint2013arXiv

Cost-Sensitive Feature Selection of Data with Errors

In data mining applications, feature selection is an essential process since it reduces a model's complexity. The cost of obtaining the feature values must be taken into consideration in many domains. In this paper, we study the cost-sensitive feature selection problem on numerical data with measurement errors, test costs and misclassification costs. The major contributions of this paper are four-fold. First, a new data model is built to address test costs and misclassification costs as well as error boundaries. Second, a covering-based rough set with measurement errors is constructed. Given a confidence interval, the neighborhood is an ellipse in a two-dimension space, or an ellipsoidal in a three-dimension space, etc. Third, a new cost-sensitive feature selection problem is defined on this covering-based rough set. Fourth, both backtracking and heuristic algorithms are proposed to deal with this new problem. The algorithms are tested on six UCI (University of California - Irvine) data sets. Experimental results show that (1) the pruning techniques of the backtracking algorithm help reducing the number of operations significantly, and (2) the heuristic algorithm usually obtains optimal results. This study is a step toward realistic applications of cost-sensitive learning.

preprint2013arXiv

Granular association rule mining through parametric rough sets for cold start recommendation

Granular association rules reveal patterns hide in many-to-many relationships which are common in relational databases. In recommender systems, these rules are appropriate for cold start recommendation, where a customer or a product has just entered the system. An example of such rules might be "40% men like at least 30% kinds of alcohol; 45% customers are men and 6% products are alcohol." Mining such rules is a challenging problem due to pattern explosion. In this paper, we propose a new type of parametric rough sets on two universes to study this problem. The model is deliberately defined such that the parameter corresponds to one threshold of rules. With the lower approximation operator in the new parametric rough sets, a backward algorithm is designed for the rule mining problem. Experiments on two real world data sets show that the new algorithm is significantly faster than the existing sandwich algorithm. This study indicates a new application area, namely recommender systems, of relational data mining, granular computing and rough sets.

preprint2013arXiv

Granular association rules for multi-valued data

Granular association rule is a new approach to reveal patterns hide in many-to-many relationships of relational databases. Different types of data such as nominal, numeric and multi-valued ones should be dealt with in the process of rule mining. In this paper, we study multi-valued data and develop techniques to filter out strong however uninteresting rules. An example of such rule might be "male students rate movies released in 1990s that are NOT thriller." This kind of rules, called negative granular association rules, often overwhelms positive ones which are more useful. To address this issue, we filter out negative granules such as "NOT thriller" in the process of granule generation. In this way, only positive granular association rules are generated and strong ones are mined. Experimental results on the movielens data set indicate that most rules are negative, and our technique is effective to filter them out.

preprint2013arXiv

Minimal cost feature selection of data with normal distribution measurement errors

Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. This issue has been addressed recently on nominal data. In this paper, we consider numerical data with measurement errors and study minimal cost feature selection in this model. First, we build a data model with normal distribution measurement errors. Second, the neighborhood of each data item is constructed through the confidence interval. Comparing with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, we define a new minimal total cost feature selection problem through considering the trade-off between test costs and misclassification costs. Fourth, we proposed a backtracking algorithm with three effective pruning techniques to deal with this problem. The algorithm is tested on four UCI data sets. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for data sets with nearly one thousand objects.

preprint2013arXiv

Mining top-k granular association rules for recommendation

Recommender systems are important for e-commerce companies as well as researchers. Recently, granular association rules have been proposed for cold-start recommendation. However, existing approaches reserve only globally strong rules; therefore some users may receive no recommendation at all. In this paper, we propose to mine the top-k granular association rules for each user. First we define three measures of granular association rules. These are the source coverage which measures the user granule size, the target coverage which measures the item granule size, and the confidence which measures the strength of the association. With the confidence measure, rules can be ranked according to their strength. Then we propose algorithms for training the recommender and suggesting items to each user. Experimental are undertaken on a publicly available data set MovieLens. Results indicate that the appropriate setting of granule can avoid over-fitting and at the same time, help obtaining high recommending accuracy.

preprint2013arXiv

Test-cost-sensitive attribute reduction of data with normal distribution measurement errors

The measurement error with normal distribution is universal in applications. Generally, smaller measurement error requires better instrument and higher test cost. In decision making based on attribute values of objects, we shall select an attribute subset with appropriate measurement error to minimize the total test cost. Recently, error-range-based covering rough set with uniform distribution error was proposed to investigate this issue. However, the measurement errors satisfy normal distribution instead of uniform distribution which is rather simple for most applications. In this paper, we introduce normal distribution measurement errors to covering-based rough set model, and deal with test-cost-sensitive attribute reduction problem in this new model. The major contributions of this paper are four-fold. First, we build a new data model based on normal distribution measurement errors. With the new data model, the error range is an ellipse in a two-dimension space. Second, the covering-based rough set with normal distribution measurement errors is constructed through the "3-sigma" rule. Third, the test-cost-sensitive attribute reduction problem is redefined on this covering-based rough set. Fourth, a heuristic algorithm is proposed to deal with this problem. The algorithm is tested on ten UCI (University of California - Irvine) datasets. The experimental results show that the algorithm is more effective and efficient than the existing one. This study is a step toward realistic applications of cost-sensitive learning.

preprint2012arXiv

A Comparative Study of Discretization Approaches for Granular Association Rule Mining

Granular association rule mining is a new relational data mining approach to reveal patterns hidden in multiple tables. The current research of granular association rule mining considers only nominal data. In this paper, we study the impact of discretization approaches on mining semantically richer and stronger rules from numeric data. Specifically, the Equal Width approach and the Equal Frequency approach are adopted and compared. The setting of interval numbers is a key issue in discretization approaches, so we compare different settings through experiments on a well-known real life data set. Experimental results show that: 1) discretization is an effective preprocessing technique in mining stronger rules; 2) the Equal Frequency approach helps generating more rules than the Equal Width approach; 3) with certain settings of interval numbers, we can obtain much more rules than others.

preprint2012arXiv

Cost-sensitive C4.5 with post-pruning and competition

Decision tree is an effective classification approach in data mining and machine learning. In applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3 such as CS-ID3, IDX, λ-ID3 have been proposed to deal with the issue. These algorithms deal with only symbolic data. In this paper, we develop a decision tree algorithm inspired by C4.5 for numeric data. There are two major issues for our algorithm. First, we develop the test cost weighted information gain ratio as the heuristic information. According to this heuristic information, our algorithm is to pick the attribute that provides more gain ratio and costs less for each selection. Second, we design a post-pruning strategy through considering the tradeoff between test costs and misclassification costs of the generated decision tree. In this way, the total cost is reduced. Experimental results indicate that (1) our algorithm is stable and effective; (2) the post-pruning technique reduces the total cost significantly; (3) the competition strategy is effective to obtain a cost-sensitive decision tree with low cost.

preprint2012arXiv

Feature selection with test cost constraint

Feature selection is an important preprocessing step in machine learning and data mining. In real-world applications, costs, including money, time and other resources, are required to acquire the features. In some cases, there is a test cost constraint due to limited resources. We shall deliberately select an informative and cheap feature subset for classification. This paper proposes the feature selection with test cost constraint problem for this issue. The new problem has a simple form while described as a constraint satisfaction problem (CSP). Backtracking is a general algorithm for CSP, and it is efficient in solving the new problem on medium-sized data. As the backtracking algorithm is not scalable to large datasets, a heuristic algorithm is also developed. Experimental results show that the heuristic algorithm can find the optimal solution in most cases. We also redefine some existing feature selection problems in rough sets, especially in decision-theoretic rough sets, from the viewpoint of CSP. These new definitions provide insight to some new research directions.