Source author record

Fan Min

Fan Min appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Information Retrieval Databases Machine Learning

Catalog footprint

What is connected

12works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Granular association rules on two universes with four measures

Relational association rules reveal patterns hide in multiple tables. Existing rules are usually evaluated through two measures, namely support and confidence. However, these two measures may not be enough to describe the strength of a rule. In this paper, we introduce granular association rules with four measures to reveal connections between granules in two universes, and propose three algorithms for rule mining. An example of such a rule might be "40% men like at least 30% kinds of alcohol; 45% customers are men and 6% products are alcohol." Here 45%, 6%, 40%, and 30% are the source coverage, the target coverage, the source confidence, and the target confidence, respectively. With these measures, our rules are semantically richer than existing ones. Three subtypes of rules are obtained through considering special requirements on the source/target confidence. Then we define a rule mining problem, and design a sandwich algorithm with different rule checking approaches for different subtypes. Experiments on a real world dataset show that the approaches dedicated to three subtypes are 2-3 orders of magnitudes faster than the one for the general case. A forward algorithm and a backward algorithm for one particular subtype can speed up the mining process further. This work opens a new research trend concerning relational association rule mining, granular computing and rough sets.

preprint2013arXiv

Characteristic matrix of covering and its application to boolean matrix decomposition and axiomatization

Covering is an important type of data structure while covering-based rough sets provide an efficient and systematic theory to deal with covering data. In this paper, we use boolean matrices to represent and axiomatize three types of covering approximation operators. First, we define two types of characteristic matrices of a covering which are essentially square boolean ones, and their properties are studied. Through the characteristic matrices, three important types of covering approximation operators are concisely equivalently represented. Second, matrix representations of covering approximation operators are used in boolean matrix decomposition. We provide a sufficient and necessary condition for a square boolean matrix to decompose into the boolean product of another one and its transpose. And we develop an algorithm for this boolean matrix decomposition. Finally, based on the above results, these three types of covering approximation operators are axiomatized using boolean matrices. In a word, this work borrows extensively from boolean matrices and present a new view to study covering-based rough sets.

preprint2013arXiv

Cold-start recommendation through granular association rules

Recommender systems are popular in e-commerce as they suggest items of interest to users. Researchers have addressed the cold-start problem where either the user or the item is new. However, the situation with both new user and new item has seldom been considered. In this paper, we propose a cold-start recommendation approach to this situation based on granular association rules. Specifically, we provide a means for describing users and items through information granules, a means for generating association rules between users and items, and a means for recommending items to users using these rules. Experiments are undertaken on a publicly available dataset MovieLens. Results indicate that rule sets perform similarly on the training and the testing sets, and the appropriate setting of granule is essential to the application of granular association rules.

preprint2013arXiv

Cost-Sensitive Feature Selection of Data with Errors

In data mining applications, feature selection is an essential process since it reduces a model's complexity. The cost of obtaining the feature values must be taken into consideration in many domains. In this paper, we study the cost-sensitive feature selection problem on numerical data with measurement errors, test costs and misclassification costs. The major contributions of this paper are four-fold. First, a new data model is built to address test costs and misclassification costs as well as error boundaries. Second, a covering-based rough set with measurement errors is constructed. Given a confidence interval, the neighborhood is an ellipse in a two-dimension space, or an ellipsoidal in a three-dimension space, etc. Third, a new cost-sensitive feature selection problem is defined on this covering-based rough set. Fourth, both backtracking and heuristic algorithms are proposed to deal with this new problem. The algorithms are tested on six UCI (University of California - Irvine) data sets. Experimental results show that (1) the pruning techniques of the backtracking algorithm help reducing the number of operations significantly, and (2) the heuristic algorithm usually obtains optimal results. This study is a step toward realistic applications of cost-sensitive learning.

preprint2013arXiv

Granular association rule mining through parametric rough sets for cold start recommendation

Granular association rules reveal patterns hide in many-to-many relationships which are common in relational databases. In recommender systems, these rules are appropriate for cold start recommendation, where a customer or a product has just entered the system. An example of such rules might be "40% men like at least 30% kinds of alcohol; 45% customers are men and 6% products are alcohol." Mining such rules is a challenging problem due to pattern explosion. In this paper, we propose a new type of parametric rough sets on two universes to study this problem. The model is deliberately defined such that the parameter corresponds to one threshold of rules. With the lower approximation operator in the new parametric rough sets, a backward algorithm is designed for the rule mining problem. Experiments on two real world data sets show that the new algorithm is significantly faster than the existing sandwich algorithm. This study indicates a new application area, namely recommender systems, of relational data mining, granular computing and rough sets.

preprint2013arXiv

Granular association rules for multi-valued data

Granular association rule is a new approach to reveal patterns hide in many-to-many relationships of relational databases. Different types of data such as nominal, numeric and multi-valued ones should be dealt with in the process of rule mining. In this paper, we study multi-valued data and develop techniques to filter out strong however uninteresting rules. An example of such rule might be "male students rate movies released in 1990s that are NOT thriller." This kind of rules, called negative granular association rules, often overwhelms positive ones which are more useful. To address this issue, we filter out negative granules such as "NOT thriller" in the process of granule generation. In this way, only positive granular association rules are generated and strong ones are mined. Experimental results on the movielens data set indicate that most rules are negative, and our technique is effective to filter them out.

preprint2013arXiv

Minimal cost feature selection of data with normal distribution measurement errors

Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. This issue has been addressed recently on nominal data. In this paper, we consider numerical data with measurement errors and study minimal cost feature selection in this model. First, we build a data model with normal distribution measurement errors. Second, the neighborhood of each data item is constructed through the confidence interval. Comparing with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, we define a new minimal total cost feature selection problem through considering the trade-off between test costs and misclassification costs. Fourth, we proposed a backtracking algorithm with three effective pruning techniques to deal with this problem. The algorithm is tested on four UCI data sets. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for data sets with nearly one thousand objects.

preprint2013arXiv

Mining top-k granular association rules for recommendation

Recommender systems are important for e-commerce companies as well as researchers. Recently, granular association rules have been proposed for cold-start recommendation. However, existing approaches reserve only globally strong rules; therefore some users may receive no recommendation at all. In this paper, we propose to mine the top-k granular association rules for each user. First we define three measures of granular association rules. These are the source coverage which measures the user granule size, the target coverage which measures the item granule size, and the confidence which measures the strength of the association. With the confidence measure, rules can be ranked according to their strength. Then we propose algorithms for training the recommender and suggesting items to each user. Experimental are undertaken on a publicly available data set MovieLens. Results indicate that the appropriate setting of granule can avoid over-fitting and at the same time, help obtaining high recommending accuracy.

preprint2013arXiv

Test-cost-sensitive attribute reduction of data with normal distribution measurement errors

The measurement error with normal distribution is universal in applications. Generally, smaller measurement error requires better instrument and higher test cost. In decision making based on attribute values of objects, we shall select an attribute subset with appropriate measurement error to minimize the total test cost. Recently, error-range-based covering rough set with uniform distribution error was proposed to investigate this issue. However, the measurement errors satisfy normal distribution instead of uniform distribution which is rather simple for most applications. In this paper, we introduce normal distribution measurement errors to covering-based rough set model, and deal with test-cost-sensitive attribute reduction problem in this new model. The major contributions of this paper are four-fold. First, we build a new data model based on normal distribution measurement errors. With the new data model, the error range is an ellipse in a two-dimension space. Second, the covering-based rough set with normal distribution measurement errors is constructed through the "3-sigma" rule. Third, the test-cost-sensitive attribute reduction problem is redefined on this covering-based rough set. Fourth, a heuristic algorithm is proposed to deal with this problem. The algorithm is tested on ten UCI (University of California - Irvine) datasets. The experimental results show that the algorithm is more effective and efficient than the existing one. This study is a step toward realistic applications of cost-sensitive learning.

preprint2012arXiv

A Comparative Study of Discretization Approaches for Granular Association Rule Mining

Granular association rule mining is a new relational data mining approach to reveal patterns hidden in multiple tables. The current research of granular association rule mining considers only nominal data. In this paper, we study the impact of discretization approaches on mining semantically richer and stronger rules from numeric data. Specifically, the Equal Width approach and the Equal Frequency approach are adopted and compared. The setting of interval numbers is a key issue in discretization approaches, so we compare different settings through experiments on a well-known real life data set. Experimental results show that: 1) discretization is an effective preprocessing technique in mining stronger rules; 2) the Equal Frequency approach helps generating more rules than the Equal Width approach; 3) with certain settings of interval numbers, we can obtain much more rules than others.

preprint2012arXiv

Cost-sensitive C4.5 with post-pruning and competition

Decision tree is an effective classification approach in data mining and machine learning. In applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3 such as CS-ID3, IDX, λ-ID3 have been proposed to deal with the issue. These algorithms deal with only symbolic data. In this paper, we develop a decision tree algorithm inspired by C4.5 for numeric data. There are two major issues for our algorithm. First, we develop the test cost weighted information gain ratio as the heuristic information. According to this heuristic information, our algorithm is to pick the attribute that provides more gain ratio and costs less for each selection. Second, we design a post-pruning strategy through considering the tradeoff between test costs and misclassification costs of the generated decision tree. In this way, the total cost is reduced. Experimental results indicate that (1) our algorithm is stable and effective; (2) the post-pruning technique reduces the total cost significantly; (3) the competition strategy is effective to obtain a cost-sensitive decision tree with low cost.

preprint2012arXiv

Feature selection with test cost constraint

Feature selection is an important preprocessing step in machine learning and data mining. In real-world applications, costs, including money, time and other resources, are required to acquire the features. In some cases, there is a test cost constraint due to limited resources. We shall deliberately select an informative and cheap feature subset for classification. This paper proposes the feature selection with test cost constraint problem for this issue. The new problem has a simple form while described as a constraint satisfaction problem (CSP). Backtracking is a general algorithm for CSP, and it is efficient in solving the new problem on medium-sized data. As the backtracking algorithm is not scalable to large datasets, a heuristic algorithm is also developed. Experimental results show that the heuristic algorithm can find the optimal solution in most cases. We also redefine some existing feature selection problems in rough sets, especially in decision-theoretic rough sets, from the viewpoint of CSP. These new definitions provide insight to some new research directions.

Fan Min

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Granular association rules on two universes with four measures

Characteristic matrix of covering and its application to boolean matrix decomposition and axiomatization

Cold-start recommendation through granular association rules

Cost-Sensitive Feature Selection of Data with Errors

Granular association rule mining through parametric rough sets for cold start recommendation

Granular association rules for multi-valued data

Minimal cost feature selection of data with normal distribution measurement errors

Mining top-k granular association rules for recommendation

Test-cost-sensitive attribute reduction of data with normal distribution measurement errors

A Comparative Study of Discretization Approaches for Granular Association Rule Mining

Cost-sensitive C4.5 with post-pruning and competition

Feature selection with test cost constraint