Source author record

Jérôme Darmont

Jérôme Darmont appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Information Retrieval Computer Vision Distributed, Parallel, and Cluster Computing Machine Learning Performance Social and Information Networks

Catalog footprint

What is connected

20works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus.

preprint2020arXiv

Automatic Integration Issues of Tabular Data for On-Line Analysis Processing

Companies and individuals produce numerous tabular data. The objective of this position paper is to draw up the challenges posed by the automatic integration of data in the form of tables so that they can be cross-analyzed. We provide a first automatic solution for the integration of such tabular data to allow On-Line Analysis Processing. To fulfil this task, features of tabular data should be analyzed and the challenge of automatic multidimensional schema generation should be addressed. Hence, we propose a typology of tabular data and discuss our idea of an automatic solution.

preprint2020arXiv

Including Images into Message Veracity Assessment in Social Media

The extensive use of social media in the diffusion of information has also laid a fertile ground for the spread of rumors, which could significantly affect the credibility of social media. An ever-increasing number of users post news including, in addition to text, multimedia data such as images and videos. Yet, such multimedia content is easily editable due to the broad availability of simple and effective image and video processing tools. The problem of assessing the veracity of social network posts has attracted a lot of attention from researchers in recent years. However, almost all previous works have focused on analyzing textual contents to determine veracity, while visual contents, and more particularly images, remains ignored or little exploited in the literature. In this position paper, we propose a framework that explores two novel ways to assess the veracity of messages published on social networks by analyzing the credibility of both their textual and visual contents.

preprint2017arXiv

Benchmarking data warehouses

Data warehouse architectural choices and optimization techniques are critical to decision support query performance. To facilitate these choices, the performance of the designed data warehouse must be assessed, usually with benchmarks. These tools can either help system users comparing the performances of different systems, or help system engineers testing the effect of various design choices. While the Transaction Processing Performance Council's standard benchmarks address the first point, they are not tunable enough to address the second one and fail to model different data warehouse schemas. By contrast, our Data Warehouse Engineering Benchmark (DWEB) allows generating various ad-hoc synthetic data warehouses and workloads. DWEB is implemented as a Java free software that can be interfaced with most existing relational database management systems. The full specifications of DWEB, as well as experiments we performed to illustrate how our benchmark may be used, are provided in this paper.

preprint2017arXiv

Evaluating the Dynamic Behavior of Database Applications

This paper explores the effect that changing access patterns has on the performance of database management systems. Changes in access patterns play an important role in determining the efficiency of key performance optimization techniques, such as dynamic clustering, prefetching, and buffer replacement. However, all existing benchmarks or evaluation frameworks produce static access patterns in which objects are always accessed in the same order repeatedly. Hence, we have proposed the Dynamic Evaluation Framework (DEF) that simulates access pattern changes using configurable styles of change. DEF has been designed to be open and fully extensible (e.g., new access pattern change models can be added easily). In this paper, we instantiate DEF into the Dynamic object Evaluation Framework (DoEF) which is designed for object databases, i.e., object-oriented or object-relational databases such as multi-media databases or most XML databases.The capabilities of DoEF have been evaluated by simulating the execution of four different dynamic clustering algorithms. The results confirm our analysis that flexible conservative re-clustering is the key in determining a clustering algorithm's ability to adapt to changes in access pattern. These results show the effectiveness of DoEF at determining the adaptability of each dynamic clustering algorithm to changes in access pattern in a simulation environment. In a second set of experiments, we have used DoEF to compare the performance of two real-life object stores : Platypus and SHORE. DoEF has helped to reveal the poor swapping performance of Platypus.

preprint2016arXiv

A comparison study of object-oriented database clustering techniques

It is widely acknowledged that a good object clustering is critical to the performance of OODBs. Clustering means storing related objects close together on secondary storage so that when one object is accessed from disk, all its related objects are also brought into memory. Then access to these related objects is a main memory access that is much faster than a disk access. The aim of this paper is to compare the performance of three clustering algorithms: Cactis, CK and ORION. Simulation experiments we performed showed that the Cactis algorithm is better than the ORION algorithm and that the CK algorithm totally out-performs both other algorithms in terms of response time and clustering overhead.

preprint2016arXiv

A Scalable Document-based Architecture for Text Analysis

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g., stem or lemma extraction, part-of-speech tagging, named entities recognition...), and performance and scaling issues. Existing text analysis architectures partly solve these issues, providing restrictive data schemas, addressing only one aspect of text preprocessing and focusing on one single task when dealing with performance optimization. %As a result, no definite solution is currently available. Thus, we propose in this paper a new generic text analysis architecture, where document structure is flexible, many preprocessing techniques are integrated and textual datasets are indexed for efficient access. We implement our conceptual architecture using both a relational and a document-oriented database. Our experiments demonstrate the feasibility of our approach and the superiority of the document-oriented logical and physical implementation.

preprint2016arXiv

Benchmarking OODBs with a Generic Tool

We present in this paper a generic object-oriented benchmark (OCB: the Object Clustering Benchmark) that has been designed to evaluate the performances of Object-Oriented Data-bases (OODBs), and more specifically the performances of clustering policies within OODBs. OCB is generic because its sample database may be customized to fit any of the databases in-troduced by the main existing benchmarks, e.g., OO1 (Object Operation 1) or OO7. The first version of OCB was purposely clustering-oriented due to a clustering-oriented workload, but OCB has been thoroughly extended to be able to suit other purposes. Eventually, OCB's code is compact and easily portable. OCB has been validated through two implementations: one within the O2 OODB and another one within the Texas persistent object store. The perfor-mances of a specific clustering policy called DSTC (Dynamic, Statistical, Tunable Clustering) have also been evaluated with OCB.

preprint2016arXiv

DESP-C++: A Discrete-Event Simulation Package for C++

DESP-C++ is a C++ discrete-event random simulation engine that has been designed to be fast, very easy to use and expand, and valid. DESP-C++ is based on the resource view. Its complete architecture is presented in detail, as well as a short " user manual ". The validity of DESP-C++ is demonstrated by the simulation of three significant models. In each case, the simulation results obtained with DESP-C++ match those obtained with a validated simulation software: QNAP2. The versatility of DESP-C++ is also illustrated this way, since the modelled systems are very different from each other: a simple production system, the dining philosopher classical deadlock problem, and a complex object-oriented database management system.

preprint2016arXiv

Simulation of clustering algorithms in OODBs in order to evaluate their performances

A good object clustering is critical to the performance of object-oriented databases. However, it always involves some kind of overhead for the system. The aim of this paper is to propose a modelling methodology in order to evaluate the performances of different clustering policies. This methodology has been used to compare the performances of three clustering algorithms found in the literature (Cactis, CK and ORION) that we considered representative of the current research in the field of object clustering. The actual performance evaluation was performed using simulation. Simulation experiments showed that the Cactis algorithm is better than the ORION algorithm and that the CK algorithm totally outperforms both other algorithms in terms of response time and clustering overhead.

preprint2016arXiv

Warehousing Complex Archaeological Objects

Data organization is a difficult and essential component in cultural heritage applications. Over the years, a great amount of archaeological ceramic data have been created and processed by various methods and devices. Such ceramic data are stored in databases that concur to increase the amount of available information rapidly. However , such databases typically focus on one type of ceramic descriptors, e.g., qualitative textual descriptions, petrographic or chemical analysis results, and do not interoperate. Thus, research involving archaeological ceramics cannot easily take advantage of combining all these types of information. In this application paper, we introduce an evolution of the Ceramom database that includes text descriptors of archaeological features, chemical analysis results, and various images, including petrographic and fabric images. To illustrate what new analyses are permitted by such a database, we source it to a data warehouse and present a sample on-line analysis processing (OLAP) scenario to gain deep understanding of ceramic context.

preprint2014arXiv

fVSS: A New Secure and Cost-Efficient Scheme for Cloud Data Warehouses

Cloud business intelligence is an increasingly popular choice to deliver decision support capabilities via elastic, pay-per-use resources. However, data security issues are one of the top concerns when dealing with sensitive data. In this pa-per, we propose a novel approach for securing cloud data warehouses by flexible verifiable secret sharing, fVSS. Secret sharing encrypts and distributes data over several cloud ser-vice providers, thus enforcing data privacy and availability. fVSS addresses four shortcomings in existing secret sharing-based approaches. First, it allows refreshing the data ware-house when some service providers fail. Second, it allows on-line analysis processing. Third, it enforces data integrity with the help of both inner and outer signatures. Fourth, it helps users control the cost of cloud warehousing by balanc-ing the load among service providers with respect to their pricing policies. To illustrate fVSS' efficiency, we thoroughly compare it with existing secret sharing-based approaches with respect to security features, querying power and data storage and computing costs.

preprint2013arXiv

A Novel Query-Based Approach for Addressing Summarizability Issues in XOLAP

The business intelligence and decision-support systems used in many application domains casually rely on data warehouses, which are decision-oriented data repositories modeled as multidimensional (MD) structures. MD structures help navigate data through hierarchical levels of detail. In many real-world situations, hierarchies in MD models are complex, which causes data aggregation issues, collectively known as the summarizability problem. This problem leads to incorrect analyses and critically affects decision making. To enforce summarizability, existing approaches alter either MD models or data, and must be applied a priori, on a case-by-case basis, by an expert. To alter neither models nor data, a few query-time approaches have been proposed recently, but they only detect summarizability issues without solving them. Thus, we propose in this paper a novel approach that automatically detects and processes summarizability issues at query time, without requiring any particular expertise from the user. Moreover, while most existing approaches are based on the relational model, our approach focus on an XML MD model, since XML data is customarily used to represent business data and its format better copes with complex hierarchies than the relational model. Finally, our experiments show that our method is likely to scale better than a reference approach for addressing the summarizability problem in the MD context.

preprint2013arXiv

Benchmarking Summarizability Processing in XML Warehouses with Complex Hierarchies

Business Intelligence plays an important role in decision making. Based on data warehouses and Online Analytical Processing, a business intelligence tool can be used to analyze complex data. Still, summarizability issues in data warehouses cause ineffective analyses that may become critical problems to businesses. To settle this issue, many researchers have studied and proposed various solutions, both in relational and XML data warehouses. However, they find difficulty in evaluating the performance of their proposals since the available benchmarks lack complex hierarchies. In order to contribute to summarizability analysis, this paper proposes an extension to the XML warehouse benchmark (XWeB) with complex hierarchies. The benchmark enables us to generate XML data warehouses with scalable complex hierarchies as well as summarizability processing. We experimentally demonstrated that complex hierarchies can definitely be included into a benchmark dataset, and that our benchmark is able to compare two alternative approaches dealing with summarizability issues.

preprint2013arXiv

PRIMEBALL: a Parallel Processing Framework Benchmark for Big Data Applications in the Cloud

In this paper, we draw the specifications of a novel benchmark for comparing parallel processing frameworks in the context of big data applications hosted in the cloud. We aim at filling several gaps in already existing cloud data processing benchmarks, which lack a real-life context for their processes, thus losing relevance when trying to assess performance for real applications. Hence, we propose a fictitious news site hosted in the cloud that is to be managed by the framework under analysis, together with several objective use case scenarios and measures for evaluating system performance. The main strengths of our benchmark are parallelization capabilities supporting cloud features and big data properties.

preprint2011arXiv

An Efficient Fuzzy Clustering-Based Approach for Intrusion Detection

The need to increase accuracy in detecting sophisticated cyber attacks poses a great challenge not only to the research community but also to corporations. So far, many approaches have been proposed to cope with this threat. Among them, data mining has brought on remarkable contributions to the intrusion detection problem. However, the generalization ability of data mining-based methods remains limited, and hence detecting sophisticated attacks remains a tough task. In this thread, we present a novel method based on both clustering and classification for developing an efficient intrusion detection system (IDS). The key idea is to take useful information exploited from fuzzy clustering into account for the process of building an IDS. To this aim, we first present cornerstones to construct additional cluster features for a training set. Then, we come up with an algorithm to generate an IDS based on such cluster features and the original input features. Finally, we experimentally prove that our method outperforms several well-known methods.

preprint2011arXiv

Business Intelligence for Small and Middle-Sized Entreprises

Data warehouses are the core of decision support sys- tems, which nowadays are used by all kind of enter- prises in the entire world. Although many studies have been conducted on the need of decision support systems (DSSs) for small businesses, most of them adopt ex- isting solutions and approaches, which are appropriate for large-scaled enterprises, but are inadequate for small and middle-sized enterprises. Small enterprises require cheap, lightweight architec- tures and tools (hardware and software) providing on- line data analysis. In order to ensure these features, we review web-based business intelligence approaches. For real-time analysis, the traditional OLAP architecture is cumbersome and storage-costly; therefore, we also re- view in-memory processing. Consequently, this paper discusses the existing approa- ches and tools working in main memory and/or with web interfaces (including freeware tools), relevant for small and middle-sized enterprises in decision making.

preprint2011arXiv

Efficient Incremental Breadth-Depth XML Event Mining

Many applications log a large amount of events continuously. Extracting interesting knowledge from logged events is an emerging active research area in data mining. In this context, we propose an approach for mining frequent events and association rules from logged events in XML format. This approach is composed of two-main phases: I) constructing a novel tree structure called Frequency XML-based Tree (FXT), which contains the frequency of events to be mined; II) querying the constructed FXT using XQuery to discover frequent itemsets and association rules. The FXT is constructed with a single-pass over logged data. We implement the proposed algorithm and study various performance issues. The performance study shows that the algorithm is efficient, for both constructing the FXT and discovering association rules.

preprint2011arXiv

Pattern tree-based XOLAP rollup operator for XML complex hierarchies

With the rise of XML as a standard for representing business data, XML data warehousing appears as a suitable solution for decision-support applications. In this context, it is necessary to allow OLAP analyses on XML data cubes. Thus, XQuery extensions are needed. To define a formal framework and allow much-needed performance optimizations on analytical queries expressed in XQuery, defining an algebra is desirable. However, XML-OLAP (XOLAP) algebras from the literature still largely rely on the relational model. Hence, we propose in this paper a rollup operator based on a pattern tree in order to handle multidimensional XML data expressed within complex hierarchies.

preprint2011arXiv

XWeB: the XML Warehouse Benchmark

With the emergence of XML as a standard for representing business data, new decision support applications are being developed. These XML data warehouses aim at supporting On-Line Analytical Processing (OLAP) operations that manipulate irregular XML data. To ensure feasibility of these new tools, important performance issues must be addressed. Performance is customarily assessed with the help of benchmarks. However, decision support benchmarks do not currently support XML features. In this paper, we introduce the XML Warehouse Benchmark (XWeB), which aims at filling this gap. XWeB derives from the relational decision support benchmark TPC-H. It is mainly composed of a test data warehouse that is based on a unified reference model for XML warehouses and that features XML-specific structures, and its associate XQuery decision support workload. XWeB's usage is illustrated by experiments on several XML database management systems.

Jérôme Darmont

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

Automatic Integration Issues of Tabular Data for On-Line Analysis Processing

Including Images into Message Veracity Assessment in Social Media

Benchmarking data warehouses

Evaluating the Dynamic Behavior of Database Applications

A comparison study of object-oriented database clustering techniques

A Scalable Document-based Architecture for Text Analysis

Benchmarking OODBs with a Generic Tool

DESP-C++: A Discrete-Event Simulation Package for C++

Simulation of clustering algorithms in OODBs in order to evaluate their performances

Warehousing Complex Archaeological Objects

fVSS: A New Secure and Cost-Efficient Scheme for Cloud Data Warehouses

A Novel Query-Based Approach for Addressing Summarizability Issues in XOLAP

Benchmarking Summarizability Processing in XML Warehouses with Complex Hierarchies

PRIMEBALL: a Parallel Processing Framework Benchmark for Big Data Applications in the Cloud

An Efficient Fuzzy Clustering-Based Approach for Intrusion Detection

Business Intelligence for Small and Middle-Sized Entreprises

Efficient Incremental Breadth-Depth XML Event Mining

Pattern tree-based XOLAP rollup operator for XML complex hierarchies

XWeB: the XML Warehouse Benchmark