Source author record

Engelbert Mephu Nguifo

Engelbert Mephu Nguifo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Machine Learning Artificial Intelligence Computational Engineering, Finance, and Science Information Theory math.IT Social and Information Networks Data Structures and Algorithms Distributed, Parallel, and Cluster Computing eess.IV Genomics Information Retrieval Neural and Evolutionary Computing Quantitative Methods

Catalog footprint

What is connected

14works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Expert Opinion Elicitation for Assisting Deep Learning based Lyme Disease Classifier with Patient Data

Diagnosing erythema migrans (EM) skin lesion, the most common early symptom of Lyme disease using deep learning techniques can be effective to prevent long-term complications. Existing works on deep learning based EM recognition only utilizes lesion image due to the lack of a dataset of Lyme disease related images with associated patient data. Physicians rely on patient information about the background of the skin lesion to confirm their diagnosis. In order to assist the deep learning model with a probability score calculated from patient data, this study elicited opinion from fifteen doctors. For the elicitation process, a questionnaire with questions and possible answers related to EM was prepared. Doctors provided relative weights to different answers to the questions. We converted doctors evaluations to probability scores using Gaussian mixture based density estimation. For elicited probability model validation, we exploited formal concept analysis and decision tree. The elicited probability scores can be utilized to make image based deep learning Lyme disease pre-scanners robust.

preprint2020arXiv

Discovering Frequent Gradual Itemsets with Imprecise Data

The gradual patterns that model the complex co-variations of attributes of the form "The more/less X, The more/less Y" play a crucial role in many real world applications where the amount of numerical data to manage is important, this is the biological data. Recently, these types of patterns have caught the attention of the data mining community, where several methods have been defined to automatically extract and manage these patterns from different data models. However, these methods are often faced the problem of managing the quantity of mined patterns, and in many practical applications, the calculation of all these patterns can prove to be intractable for the user-defined frequency threshold and the lack of focus leads to generating huge collections of patterns. Moreover another problem with the traditional approaches is that the concept of gradualness is defined just as an increase or a decrease. Indeed, a gradualness is considered as soon as the values of the attribute on both objects are different. As a result, numerous quantities of patterns extracted by traditional algorithms can be presented to the user although their gradualness is only a noise effect in the data. To address this issue, this paper suggests to introduce the gradualness thresholds from which to consider an increase or a decrease. In contrast to literature approaches, the proposed approach takes into account the distribution of attribute values, as well as the user's preferences on the gradualness threshold and makes it possible to extract gradual patterns on certain databases where literature approaches fail due to too large search space. Moreover, results from an experimental evaluation on real databases show that the proposed algorithm is scalable, efficient, and can eliminate numerous patterns that do not verify specific gradualness requirements to show a small set of patterns to the user.

preprint2020arXiv

Multiple instance learning for sequence data with across bag dependencies

In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation. In this work, we present two novel MIL approaches for sequence data classification named ABClass and ABSim. ABClass extracts motifs from related instances and use them to encode sequences. A discriminative classifier is then applied to compute a partial classification result for each set of related sequences. ABSim uses a similarity measure to discriminate the related instances and to compute a scores matrix. For both approaches, an aggregation method is applied in order to generate the final classification result. We applied both approaches to solve the problem of bacterial Ionizing Radiation Resistance prediction. The experimental results of the presented approaches are satisfactory.

preprint2014arXiv

A perceptual hash function to store and retrieve large scale DNA sequences

This paper proposes a novel approach for storing and retrieving massive DNA sequences.. The method is based on a perceptual hash function, commonly used to determine the similarity between digital images, that we adapted for DNA sequences. Perceptual hash function presented here is based on a Discrete Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray level intensity pixel and the hash is calculated from its significant frequency characteristics. This results to a drastic data reduction between the sequence and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes are not affected by "avalanche effect" and thus can be compared. The similarity distance between two hashes is estimated with the Hamming Distance, which is used to retrieve DNA sequences. Experiments that we conducted show that our approach is relevant for storing massive DNA sequences, and retrieving them.

preprint2014arXiv

Towards a constructive multilayer perceptron for regression task using non-parametric clustering. A case study of Photo-Z redshift reconstruction

The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad hoc approaches for multilayer ANN architecture. Our method can be applied to regression problems. Experimental results obtained with different datasets, reveals the efficiency of our method.

preprint2013arXiv

Efficient construction of the lattice of frequent closed patterns and simultaneous extraction of generic bases of rules

In the last few years, the amount of collected data, in various computer science applications, has grown considerably. These large volumes of data need to be analyzed in order to extract useful hidden knowledge. This work focuses on association rule extraction. This technique is one of the most popular in data mining. Nevertheless, the number of extracted association rules is often very high, and many of them are redundant. In this paper, we propose a new algorithm, called PRINCE. Its main feature is the construction of a partially ordered structure for extracting subsets of association rules, called generic bases. Without loss of information these subsets form representation of the whole association rule set. To reduce the cost of such a construction, the partially ordered structure is built thanks to the minimal generators associated to frequent closed patterns. The closed ones are simultaneously derived with generic bases thanks to a simple bottom-up traversal of the obtained structure. The experimentations we carried out in benchmark and "worst case" contexts showed the efficiency of the proposed algorithm, compared to algorithms like CLOSE, A-CLOSE and TITANIC.

preprint2013arXiv

Nouvelle approche de recommandation personnalisee dans les folksonomies basee sur le profil des utilisateurs

In folksonomies, users use to share objects (movies, books, bookmarks, etc.) by annotating them with a set of tags of their own choice. With the rise of the Web 2.0 age, users become the core of the system since they are both the contributors and the creators of the information. Yet, each user has its own profile and its own ideas making thereby the strength as well as the weakness of folksonomies. Indeed, it would be helpful to take account of users' profile when suggesting a list of tags and resources or even a list of friends, in order to make a personal recommandation, instead of suggesting the more used tags and resources in the folksonomy. In this paper, we consider users' profile as a new dimension of a folksonomy classically composed of three dimensions <users, tags, ressources> and we propose an approach to group users with equivalent profiles and equivalent interests as quadratic concepts. Then, we use such structures to propose our personalized recommendation system of users, tags and resources according to each user's profile. Carried out experiments on two real-world datasets, i.e., MovieLens and BookCrossing highlight encouraging results in terms of precision as well as a good social evaluation.

preprint2013arXiv

Towards a semantic and statistical selection of association rules

The increasing growth of databases raises an urgent need for more accurate methods to better understand the stored data. In this scope, association rules were extensively used for the analysis and the comprehension of huge amounts of data. However, the number of generated rules is too large to be efficiently analyzed and explored in any further process. Association rules selection is a classical topic to address this issue, yet, new innovated approaches are required in order to provide help to decision makers. Hence, many interesting- ness measures have been defined to statistically evaluate and filter the association rules. However, these measures present two major problems. On the one hand, they do not allow eliminating irrelevant rules, on the other hand, their abun- dance leads to the heterogeneity of the evaluation results which leads to confusion in decision making. In this paper, we propose a two-winged approach to select statistically in- teresting and semantically incomparable rules. Our statis- tical selection helps discovering interesting association rules without favoring or excluding any measure. The semantic comparability helps to decide if the considered association rules are semantically related i.e comparable. The outcomes of our experiments on real datasets show promising results in terms of reduction in the number of rules.

preprint2013arXiv

Towards an Efficient Discovery of the Topological Representative Subgraphs

With the emergence of graph databases, the task of frequent subgraph discovery has been extensively addressed. Although the proposed approaches in the literature have made this task feasible, the number of discovered frequent subgraphs is still very high to be efficiently used in any further exploration. Feature selection for graph data is a way to reduce the high number of frequent subgraphs based on exact or approximate structural similarity. However, current structural similarity strategies are not efficient enough in many real-world applications, besides, the combinatorial nature of graphs makes it computationally very costly. In order to select a smaller yet structurally irredundant set of subgraphs, we propose a novel approach that mines the top-k topological representative subgraphs among the frequent ones. Our approach allows detecting hidden structural similarities that existing approaches are unable to detect such as the density or the diameter of the subgraph. In addition, it can be easily extended using any user defined structural or topological attributes depending on the sought properties. Empirical studies on real and synthetic graph datasets show that our approach is fast and scalable.

preprint2012arXiv

A large-scale and fault-tolerant approach of subgraph mining using density-based partitioning

Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches for frequent subgraph discovery in large clusters, has taken place. However, failures are a norm rather than being an exception in large clusters. In this context, the MapReduce framework was designed so that node failures are automatically handled by the framework. In this paper, we propose a large-scale and fault-tolerant approach of subgraph mining by means of a density-based partitioning technique, using MapReduce. Our partitioning aims to balance computation load on a collection of machines. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases.

preprint2012arXiv

A scalable mining of frequent quadratic concepts in d-folksonomies

Folksonomy mining is grasping the interest of web 2.0 community since it represents the core data of social resource sharing systems. However, a scrutiny of the related works interested in mining folksonomies unveils that the time stamp dimension has not been considered. For example, the wealthy number of works dedicated to mining tri-concepts from folksonomies did not take into account time dimension. In this paper, we will consider a folksonomy commonly composed of triples <users, tags, resources> and we shall consider the time as a new dimension. We motivate our approach by highlighting the battery of potential applications. Then, we present the foundations for mining quadri-concepts, provide a formal definition of the problem and introduce a new efficient algorithm, called QUADRICONS for its solution to allow for mining folksonomies in time, i.e., d-folksonomies. We also introduce a new closure operator that splits the induced search space into equivalence classes whose smallest elements are the quadri-minimal generators. Carried out experiments on large-scale real-world datasets highlight good performances of our algorithm.

preprint2012arXiv

Categorization of interestingness measures for knowledge extraction

Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify "good" properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.

preprint2012arXiv

Feature extraction in protein sequences classification : a new stability measure

Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properties are motifs extracted from sequences. This output can be used to apply standard machine learning tools to perform data mining tasks such as classification. Several previous works have described feature extraction methods for bio-sequence classification, but none of them discussed the robustness of these methods when perturbing the input data. In this work, we introduce the notion of stability of the generated motifs in order to study the robustness of motif extraction methods. We express this robustness in terms of the ability of the method to reveal any change occurring in the input data and also its ability to target the interesting motifs. We use these criteria to evaluate and experimentally compare four existing extraction methods for biological sequences.

preprint2010arXiv

Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures

Formal Concept Analysis "FCA" is a data analysis method which enables to discover hidden knowledge existing in data. A kind of hidden knowledge extracted from data is association rules. Different quality measures were reported in the literature to extract only relevant association rules. Given a dataset, the choice of a good quality measure remains a challenging task for a user. Given a quality measures evaluation matrix according to semantic properties, this paper describes how FCA can highlight quality measures with similar behavior in order to help the user during his choice. The aim of this article is the discovery of Interestingness Measures "IM" clusters, able to validate those found due to the hierarchical and partitioning clustering methods "AHC" and "k-means". Then, based on the theoretical study of sixty one interestingness measures according to nineteen properties, proposed in a recent study, "FCA" describes several groups of measures.

Engelbert Mephu Nguifo

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Expert Opinion Elicitation for Assisting Deep Learning based Lyme Disease Classifier with Patient Data

Discovering Frequent Gradual Itemsets with Imprecise Data

Multiple instance learning for sequence data with across bag dependencies

A perceptual hash function to store and retrieve large scale DNA sequences

Towards a constructive multilayer perceptron for regression task using non-parametric clustering. A case study of Photo-Z redshift reconstruction

Efficient construction of the lattice of frequent closed patterns and simultaneous extraction of generic bases of rules

Nouvelle approche de recommandation personnalisee dans les folksonomies basee sur le profil des utilisateurs

Towards a semantic and statistical selection of association rules

Towards an Efficient Discovery of the Topological Representative Subgraphs

A large-scale and fault-tolerant approach of subgraph mining using density-based partitioning

A scalable mining of frequent quadratic concepts in d-folksonomies

Categorization of interestingness measures for knowledge extraction

Feature extraction in protein sequences classification : a new stability measure

Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures