Source author record

Erhard Rahm

Erhard Rahm appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Distributed, Parallel, and Cluster Computing cs.CY Information Retrieval Machine Learning

Catalog footprint

What is connected

12works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Combining Time-Series and Graph Data: A Survey of Existing Systems and Approaches

We provide a comprehensive overview of current approaches and systems for combining graphs and time series data. We categorize existing systems into four architectural categories and analyze how these systems meet different requirements and exhibit distinct implementation characteristics to support both data types in a unified manner. Our overview aims to help readers understand and evaluate current options and trade-offs, such as the degree of cross-model integration, maturity, and openness.

preprint2021arXiv

EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs

Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the distance between them in the embedding space which is comparatively simple. However, previous work has shown that the use of graph embeddings alone is not sufficient to achieve high ER quality. We therefore propose a more comprehensive ER approach for knowledge graphs called EAGER (Embedding-Assisted Knowledge Graph Entity Resolution) to flexibly utilize both the similarity of graph embeddings and attribute values within a supervised machine learning approach. We evaluate our approach on 23 benchmark datasets with differently sized and structured knowledge graphs and use hypothesis tests to ensure statistical significance of our results. Furthermore we compare our approach with state-of-the-art ER solutions, where our approach yields competitive results for table-oriented ER problems and shallow knowledge graphs but much better results for deeper knowledge graphs.

preprint2015arXiv

GRADOOP: Scalable Graph Data Management and Analytics with Hadoop

Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. Previous approaches for graph analytics such as graph databases and parallel graph processing systems (e.g., Pregel) either lack sufficient scalability or flexibility and expressiveness. We are therefore developing a new end-to-end approach for graph data management and analysis based on the Hadoop ecosystem, called Gradoop (Graph analytics on Hadoop). Gradoop is designed around the so-called Extended Property Graph Data Model (EPGM) supporting semantically rich, schema-free graph data within many distinct graphs. A set of high-level operators is provided for analyzing both single graphs and collections of graphs. Based on these operators, we propose a domain-specific language to define analytical workflows. The Gradoop graph store is currently utilizing HBase for distributed storage of graph data in Hadoop clusters. An initial version of Gradoop has been used to analyze graph data for business intelligence and social network analysis.

preprint2015arXiv

Semi-automatic identification of counterfeit offers in online shopping platforms

Product counterfeiting is a serious problem causing the industry estimated losses of billions of dollars every year. With the increasing spread of e-commerce, the number of counterfeit products sold online increased substantially. We propose the adoption of a semi-automatic workflow to identify likely counterfeit offers in online platforms and to present these offers to a domain expert for manual verification. The workflow includes steps to generate search queries for relevant product offers, to match and cluster similar product offers, and to assess the counterfeit suspiciousness based on different criteria. The goal is to support the periodic identification of many counterfeit offers with a limited amount of manual effort. We explain how the proposed approach can be realized. We also present a preliminary evaluation of its most important steps on a case study using the eBay platform.

preprint2012arXiv

How do Ontology Mappings Change in the Life Sciences?

Mappings between related ontologies are increasingly used to support data integration and analysis tasks. Changes in the ontologies also require the adaptation of ontology mappings. So far the evolution of ontology mappings has received little attention albeit ontologies change continuously especially in the life sciences. We therefore analyze how mappings between popular life science ontologies evolve for different match algorithms. We also evaluate which semantic ontology changes primarily affect the mappings. We further investigate alternatives to predict or estimate the degree of future mapping changes based on previous ontology and mapping transitions.

preprint2011arXiv

Load Balancing for MapReduce-based Entity Resolution

The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches.

preprint2011arXiv

Rule-based Construction of Matching Processes

Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up that process automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.

preprint2010arXiv

Data Partitioning for Parallel Entity Matching

Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of in-put entities and affinity-based scheduling of match tasks.

preprint2010arXiv

Evaluation of Query Generators for Entity Search Engines

Dynamic web applications such as mashups need efficient access to web data that is only accessible via entity search engines (e.g. product or publication search engines). However, most current mashup systems and applications only support simple keyword searches for retrieving data from search engines. We propose the use of more powerful search strategies building on so-called query generators. For a given set of entities query generators are able to automatically determine a set of search queries to retrieve these entities from an entity search engine. We demonstrate the usefulness of query generators for on-demand web data integration and evaluate the effectiveness and efficiency of query generators for a challenging real-world integration scenario.

preprint2010arXiv

Parallel Sorted Neighborhood Blocking with MapReduce

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply a tailored data replication.

preprint2010arXiv

Rule-based Generation of Diff Evolution Mappings between Ontology Versions

Ontologies such as taxonomies, product catalogs or web directories are heavily used and hence evolve frequently to meet new requirements or to better reflect the current instance data of a domain. To effectively manage the evolution of ontologies it is essential to identify the difference (Diff) between two ontology versions. We propose a novel approach to determine an expressive and invertible diff evolution mapping between given versions of an ontology. Our approach utilizes the result of a match operation to determine an evolution mapping consisting of a set of basic change operations (insert/update/delete). To semantically enrich the evolution mapping we adopt a rule-based approach to transform the basic change operations into a smaller set of more complex change operations, such as merge, split, or changes of entire subgraphs. The proposed algorithm is customizable in different ways to meet the requirements of diverse ontologies and application scenarios. We evaluate the proposed approach by determining and analyzing evolution mappings for real-world life science ontologies and web directories.

preprint2010arXiv

Target-driven merging of Taxonomies

The proliferation of ontologies and taxonomies in many domains increasingly demands the integration of multiple such ontologies. The goal of ontology integration is to merge two or more given ontologies in order to provide a unified view on the input ontologies while maintaining all information coming from them. We propose a new taxonomy merging algorithm that, given as input two taxonomies and an equivalence matching between them, can generate an integrated taxonomy in a fully automatic manner. The approach is target-driven, i.e. we merge a source taxonomy into the target taxonomy and preserve the structure of the target ontology as much as possible. We also discuss how to extend the merge algorithm providing auxiliary information, like additional relationships between source and target concepts, in order to semantically improve the final result. The algorithm was implemented in a working prototype and evaluated using synthetic and real-world scenarios.

Erhard Rahm

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Combining Time-Series and Graph Data: A Survey of Existing Systems and Approaches

EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs

GRADOOP: Scalable Graph Data Management and Analytics with Hadoop

Semi-automatic identification of counterfeit offers in online shopping platforms

How do Ontology Mappings Change in the Life Sciences?

Load Balancing for MapReduce-based Entity Resolution

Rule-based Construction of Matching Processes

Data Partitioning for Parallel Entity Matching

Evaluation of Query Generators for Entity Search Engines

Parallel Sorted Neighborhood Blocking with MapReduce

Rule-based Generation of Diff Evolution Mappings between Ontology Versions

Target-driven merging of Taxonomies