Source author record

Ansgar Scherp

Ansgar Scherp appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Information Retrieval Computation and Language Machine Learning Digital Libraries

Catalog footprint

What is connected

6works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Bag-of-Words vs. Graph vs. Sequence in Text Classification: Questioning the Necessity of Text-Graphs and the Surprising Strength of a Wide MLP

Graph neural networks have triggered a resurgence of graph-based text classification methods, defining today's state of the art. We show that a wide multi-layer perceptron (MLP) using a Bag-of-Words (BoW) outperforms the recent graph-based models TextGCN and HeteGCN in an inductive text classification setting and is comparable with HyperGAT. Moreover, we fine-tune a sequence-based BERT and a lightweight DistilBERT model, which both outperform all state-of-the-art models. These results question the importance of synthetic graphs used in modern text classifiers. In terms of efficiency, DistilBERT is still twice as large as our BoW-based wide MLP, while graph-based models like TextGCN require setting up an $\mathcal{O}(N^2)$ graph, where $N$ is the vocabulary plus corpus size. Finally, since Transformers need to compute $\mathcal{O}(L^2)$ attention weights with sequence length $L$, the MLP models show higher training and inference speeds on datasets with long sequences.

preprint2022arXiv

General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs, such as sentence similarity and natural language inference. Our matrix-based bidirectional CMOW/CBOW-Hybrid model is competitive to DistilBERT on question similarity and recognizing textual entailment, but uses only half of the number of parameters and is three times faster in terms of inference speed. We match or exceed the scores of ELMo for all tasks of the GLUE benchmark except for the sentiment analysis task SST-2 and the linguistic acceptability task CoLA. However, compared to previous cross-architecture distillation approaches, we demonstrate a doubling of the scores on detecting linguistic acceptability. This shows that matrix-based embeddings can be used to distill large PreLM into competitive models and motivates further research in this direction.

preprint2021arXiv

FLUID: A Common Model for Semantic Structural Graph Summaries Based on Equivalence Relations

Summarization is a widespread method for handling very large graphs. The task of structural graph summarization is to compute a concise but meaningful synopsis of the key structural information of a graph. As summaries may be used for many different purposes, there is no single concept or model of graph summaries. We have studied existing structural graph summaries for large-scale (semantic) graphs. Despite their different concepts and purposes, we found commonalities in the graph structures they capture. We use these commonalities to provide for the first time a formally defined common model, FLUID (FLexible graph sUmmarIes for Data graphs), that allows us to flexibly define structural graph summaries. FLUID allows graph summaries to be quickly defined, adapted, and compared for different purposes and datasets. To this end, FLUID provides features of structural summarization based on equivalence relations such as distinction of types and properties, direction of edges, bisimulation, and inference. We conduct a detailed complexity analysis of the features provided by FLUID. We show that graph summaries defined with FLUID can be computed in the worst case in time $\mathcal{O}(n^2)$ w.r.t. $n$, the number of edges in the data graph. An empirical analysis of large-scale web graphs with billions of edges indicates a typical running time of $Θ(n)$. Based on the formal FLUID model, one can quickly define and modify various structural graph summaries from the literature and beyond.

preprint2020arXiv

Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search -- Extended Technical Report

Indexing the Web of Data offers many opportunities, in particular, to find and explore data sources. One major design decision when indexing the Web of Data is to find a suitable index model, i.e., how to index and summarize data. Various efforts have been conducted to develop specific index models for a given task. With each index model designed, implemented, and evaluated independently, it remains difficult to judge whether an approach generalizes well to another task, set of queries, or dataset. In this work, we empirically evaluate six representative index models with unique feature combinations. Among them is a new index model incorporating inferencing over RDFS and owl:sameAs. We implement all index models for the first time into a single, stream-based framework. We evaluate variations of the index models considering sub-graphs of size 0, 1, and 2 hops on two large, real-world datasets. We evaluate the quality of the indices regarding the compression ratio, summarization ratio, and F1-score denoting the approximation quality of the stream-based index computation. The experiments reveal huge variations in compression ratio, summarization ratio, and approximation quality for different index models, queries, and datasets. However, we observe meaningful correlations in the results that help to determine the right index model for a given task, type of query, and dataset.

preprint2016arXiv

Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recommendation based on Twitter Profiles? - An Extended Technical Report

So far it is unclear how different factors of a scientific publication recommender system based on users' tweets have an influence on the recommendation performance. We examine three different factors, namely profiling method, temporal decay, and richness of content. Regarding profiling, we compare CF-IDF that replaces terms in TF-IDF by semantic concepts, HCF-IDF as novel hierarchical variant of CF-IDF, and topic modeling. As temporal decay functions, we apply sliding window and exponential decay. In terms of the richness of content, we compare recommendations using both full-texts and titles of publications and using only titles. Overall, the three factors make twelve recommendation strategies. We have conducted an online experiment with 123 participants and compared the strategies in a within-group design. The best recommendations are achieved by the strategy combining CF-IDF, sliding window, and with full-texts. However, the strategies using the novel HCF-IDF profiling method achieve similar results with just using the titles of the publications. Therefore, HCF-IDF can make recommendations when only short and sparse data is available.

preprint2016arXiv

TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud - An Extended Technical Report

Deciding which vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. Choosing too general vocabulary terms, or terms from vocabularies that are not used by other LOD datasets, is likely to lead to a data representation, which will be harder to understand by humans and to be consumed by Linked data applications. In this technical report, we propose TermPicker: a novel approach for vocabulary reuse by recommending RDF types and properties based on exploiting the information on how other data providers on the LOD cloud use RDF types and properties to describe their data. To this end, we introduce the notion of so-called schema-level patterns (SLPs). They capture how sets of RDF types are connected via sets of properties within some data collection, e.g., within a dataset on the LOD cloud. TermPicker uses such SLPs and generates a ranked list of vocabulary terms for reuse. The lists of recommended terms are ordered by a ranking model which is computed using the machine learning approach Learning To Rank (L2R). TermPicker is evaluated based on the recommendation quality that is measured using the Mean Average Precision (MAP) and the Mean Reciprocal Rank at the first five positions (MRR@5). Our results illustrate an improvement of the recommendation quality by 29% - 36% when using SLPs compared to the beforehand investigated baselines of recommending solely popular vocabulary terms or terms from the same vocabulary. The overall best results are achieved using SLPs in conjunction with the Learning To Rank algorithm Random Forests.

Ansgar Scherp

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Bag-of-Words vs. Graph vs. Sequence in Text Classification: Questioning the Necessity of Text-Graphs and the Surprising Strength of a Wide MLP

General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

FLUID: A Common Model for Semantic Structural Graph Summaries Based on Equivalence Relations

Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search -- Extended Technical Report

Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recommendation based on Twitter Profiles? - An Extended Technical Report

TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud - An Extended Technical Report