Source author record

Hafsteinn Einarsson

Hafsteinn Einarsson appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language math.CO math.PR

Catalog footprint

What is connected

5works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD). Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we translate and adapt the WinoGrande dataset for co-reference resolution. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.

preprint2022arXiv

Building an Icelandic Entity Linking Corpus

In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach of using a multilingual entity linking model (mGENRE) in combination with Wikipedia API Search (WAPIS) to label our data and compare it to an approach using WAPIS only. We find that our combined method reaches 53.9% coverage on our corpus, compared to 30.9% using only WAPIS. We analyze our results and explain the value of using a multilingual system when working with Icelandic. Additionally, we analyze the data that remain unlabeled, identify patterns and discuss why they may be more difficult to annotate.

preprint2022arXiv

Cross-Lingual QA as a Stepping Stone for Monolingual Open QA in Icelandic

It can be challenging to build effective open question answering (open QA) systems for languages other than English, mainly due to a lack of labeled data for training. We present a data efficient method to bootstrap such a system for languages other than English. Our approach requires only limited QA resources in the given language, along with machine-translated data, and at least a bilingual language model. To evaluate our approach, we build such a system for the Icelandic language and evaluate performance over trivia style datasets. The corpora used for training are English in origin but machine translated into Icelandic. We train a bilingual Icelandic/English language model to embed English context and Icelandic questions following methodology introduced with DensePhrases (Lee et al., 2021). The resulting system is an open domain cross-lingual QA system between Icelandic and English. Finally, the system is adapted for Icelandic only open QA, demonstrating how it is possible to efficiently create an open QA system with limited access to curated datasets in the language of interest.

preprint2015arXiv

Bootstrap percolation with inhibition

Bootstrap percolation is a prominent framework for studying the spreading of activity on a graph. We begin with an initial set of active vertices. The process then proceeds in rounds, and further vertices become active as soon as they have a certain number of active neighbors. A recurring feature in bootstrap percolation theory is an `all-or-nothing' phenomenon: either the size of the starting set is so small that the process stops very soon, or it percolates (almost) completely. Motivated by several important phenomena observed in various types of real-world networks we propose in this work a variant of bootstrap percolation that exhibits a vastly different behavior. Our graphs have two types of vertices: some of them obstruct the diffusion, while the others facilitate it. We study the effect of this setting by analyzing the process on Erdős-Rényi random graphs. Our main findings are two-fold. First we show that the presence of vertices hindering the diffusion does not result in a stable behavior: tiny changes in the size of the starting set can dramatically influence the size of the final active set. In particular, the process is non-monotone: a larger starting set can result in a smaller final set. In the second part of the paper we show that this phenomenom arises from the round-based approach: if we move to a continuous time model in which every edge draws its transmission time randomly, then we gain stability, and the process stops with an active set that contains a non-trivial constant fraction of all vertices. Moreover, we show that in the continuous time model percolation occurs significantly faster compared to the classical round-based model. Our findings are in line with empirical observations and demonstrate the importance of introducing various types of vertex behaviors in the mathematical model.

preprint2014arXiv

Connectivity Thresholds for Bounded Size Rules

In an Achlioptas process, starting with a graph that has n vertices and no edge, in each round $d \geq 1$ edges are drawn uniformly at random, and using some rule exactly one of them is chosen and added to the evolving graph. For the class of Achlioptas processes we investigate how much impact the rule has on one of the most basic properties of a graph: connectivity. Our main results are twofold. First, we study the prominent class of bounded size rules, which select the edge to add according to the component sizes of its vertices, treating all sizes larger than some constant equally. For such rules we provide a fine analysis that exposes the limiting distribution of the number of rounds until the graph gets connected, and we give a detailed picture of the dynamics of the formation of the single component from smaller components. Second, our results allow us to study the connectivity transition of all Achlioptas processes, in the sense that we identify a process that accelerates it as much as possible.

Hafsteinn Einarsson

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language Models

Building an Icelandic Entity Linking Corpus

Cross-Lingual QA as a Stepping Stone for Monolingual Open QA in Icelandic

Bootstrap percolation with inhibition

Connectivity Thresholds for Bounded Size Rules