Source author record

Wolfgang Lehner

Wolfgang Lehner appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Computation and Language Machine Learning

Catalog footprint

What is connected

6works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Machine Learning-based Cardinality Estimation in DBMS on Pre-Aggregated Data

Cardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches can deliver more accurate cardinality estimations than traditional approaches. However, a lot of example queries have to be executed during the model training phase to learn a data-dependent ML model leading to a very time-consuming training phase. Many of those example queries use the same base data, have the same query structure, and only differ in their predicates. Thus, index structures appear to be an ideal optimization technique at first glance. However, their benefit is limited. To speed up this model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-enabled training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 63 with our aggregate-enabled training phase.

preprint2020arXiv

MorphStore: Analytical Query Engine with a Holistic Compression-Enabled Processing Model

In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column-store database systems, but mainly for base data. In particular, during query processing, these systems only keep the data compressed until an operator cannot process the compressed data directly, whereupon the data is decompressed, but not recompressed. Thus, the full potential of compression during query processing is not exploited. To overcome that, we developed a novel compression-enabled processing model as presented in this paper. As we are going to show, the continuous usage of compression for all base data and all intermediates is very beneficial to reduce the overall memory footprint as well as to improve the query performance.

preprint2020arXiv

RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data

There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naive one-to-one mapping of each word in a database to a word embedding vector is not sufficient and would lead to poor accuracies in ML tasks. Thus, we argue to additionally incorporate the information given by the database schema into the embedding, e.g. which words appear in the same column or are related to each other. In this paper, we propose RETRO (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the best of both worlds, the rich information encoded by word embeddings and the relational information encoded by database tables. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem and derive good settings for all of them. Our evaluation shows that the proposed embeddings are ready-to-use for many ML tasks such as classification and regression and even outperform state-of-the-art techniques in integration tasks such as null value imputation and link prediction.

preprint2015arXiv

GraphVista: Interactive Exploration Of Large Graphs

The potential to gain business insights from graph-structured data through graph analytics is increasingly attracting companies from a variety of industries, ranging from web companies to traditional enterprise businesses. To analyze a graph, a user often executes isolated graph queries using a dedicated interface---a procedural graph programming interface or a declarative graph query language. The results are then returned and displayed using a specific visualization technique. This follows the classical ad-hoc Query$\rightarrow$Result interaction paradigm and often requires multiple query iterations until an interesting aspect in the graph data is identified. This is caused on the one hand by the schema flexibility of graph data and on the other hand by the intricacies of declarative graph query languages. To lower the burden for the user to explore an unknown graph without prior knowledge of a graph query language, visual graph exploration provides an effective and intuitive query interface to navigate through the graph interactively. We demonstrate GRAPHVISTA, a graph visualization and exploration tool that can seamlessly combine ad-hoc querying and interactive graph exploration within the same query session. In our demonstration, conference attendees will see GRAPHVISTA running against a large real-world graph data set. They will start by identifying entry points of interest with the help of ad-hoc queries and will then discover the graph interactively through visual graph exploration.

preprint2014arXiv

GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Management Systems

Graph traversals are a basic but fundamental ingredient for a variety of graph algorithms and graph-oriented queries. To achieve the best possible query performance, they need to be implemented at the core of a database management system that aims at storing, manipulating, and querying graph data. Increasingly, modern business applications demand native graph query and processing capabilities for enterprise-critical operations on data stored in relational database management systems. In this paper we propose an extensible graph traversal framework (GRAPHITE) as a central graph processing component on a common storage engine inside a relational database management system. We study the influence of the graph topology on the execution time of graph traversals and derive two traversal algorithm implementations specialized for different graph topologies and traversal queries. We conduct extensive experiments on GRAPHITE for a large variety of real-world graph data sets and input configurations. Our experiments show that the proposed traversal algorithms differ by up to two orders of magnitude for different input configurations and therefore demonstrate the need for a versatile framework to efficiently process graph traversals on a wide range of different graph topologies and types of queries. Finally, we highlight that the query performance of our traversal implementations is competitive with those of two native graph database management systems.

preprint2012arXiv

Identifying And Weighting Integration Hypotheses On Open Data Platforms

Open data platforms such as data.gov or opendata.socrata. com provide a huge amount of valuable information. Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems. At the same time, crowd-based data integration techniques are emerging as new way of dealing with these problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper discusses integration problems on Open Data Platforms, and proposes a method for identifying and ranking integration hypotheses in this context. We will evaluate our findings by conducting a comprehensive evaluation using on one of the largest Open Data platforms.

Wolfgang Lehner

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Machine Learning-based Cardinality Estimation in DBMS on Pre-Aggregated Data

MorphStore: Analytical Query Engine with a Holistic Compression-Enabled Processing Model

RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data

GraphVista: Interactive Exploration Of Large Graphs

GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Management Systems

Identifying And Weighting Integration Hypotheses On Open Data Platforms