Source author record

Simon Hengchen

Simon Hengchen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning cs.CY

Catalog footprint

What is connected

7works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Challenges for Computational Lexical Semantic Change

The computational study of lexical semantic change (LSC) has taken off in the past few years and we are seeing increasing interest in the field, from both computational sciences and linguistics. Most of the research so far has focused on methods for modelling and detecting semantic change using large diachronic textual data, with the majority of the approaches employing neural embeddings. While methods that offer easy modelling of diachronic text are one of the main reasons for the spiking interest in LSC, neural models leave many aspects of the problem unsolved. The field has several open and complex challenges. In this chapter, we aim to describe the most important of these challenges and outline future directions.

preprint2021arXiv

Lexical semantic change for Ancient Greek and Latin

Change and its precondition, variation, are inherent in languages. Over time, new words enter the lexicon, others become obsolete, and existing words acquire new senses. Associating a word's correct meaning in its historical context is a central challenge in diachronic research. Historical corpora of classical languages, such as Ancient Greek and Latin, typically come with rich metadata, and existing models are limited by their inability to exploit contextual information beyond the document timestamp. While embedding-based methods feature among the current state of the art systems, they are lacking in the interpretative power. In contrast, Bayesian models provide explicit and interpretable representations of semantic change phenomena. In this chapter we build on GASC, a recent computational approach to semantic change based on a dynamic Bayesian mixture model. In this model, the evolution of word senses over time is based not only on distributional information of lexical nature, but also on text genres. We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models. On top of providing a full description of meaning change over time, we show that Bayesian mixture models are highly competitive approaches to detect binary semantic change in both Ancient Greek and Latin.

preprint2020arXiv

SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.

preprint2019arXiv

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

preprint2019arXiv

GASC: Genre-Aware Semantic Change for Ancient Greek

Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word's correct meaning in its historical context is a central challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document time-stamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy (the fact that a word has several senses) and semantic change (the process of acquiring, losing, or changing senses), and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts' genre to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, our model achieves improved predictive performance compared to the state of the art.

preprint2019arXiv

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change

State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have empirically tested the Temporal Referencing method for lexical semantic change and show that, by avoiding alignment, it is less affected by this noise. We show that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset. We introduce a principled way to simulate lexical semantic change and systematically control for possible biases.

preprint2016arXiv

How hot is .brussels? Impact of the uptake of the .brussels top-level domain name extension

The opening up of the top-level domain name market in 2012 has offered new perspectives for companies, administrations and individuals to include a geographic component within the domain name of their website. Little to no research has been carried out since then to analyse the uptake of the new top-level domain names (TLDN). Based on the specific case of the TLDN .brussels, this article proposes an empirical study of how the opening up of the top-level domain name market actually impacts registration practices. By making use of freely available software tools such as OpenRefine and Natural Language Processing (NLP) methods, the entire corpus of the .brussels domain names (6300) was analysed from a quantitative perspective. Based on a statistically representative sample, a qualitative interpretation allowed for a more fine-grained analysis of how the new TLDN is being used in practice. By doing so, the article gives a detailed insight into the impact of the recent changes of the rules concerning domain name registration. Researchers, policy makers, investors and anyone who cares about the Brussels identity in the digital realm can gain through this analysis a better understanding of the state of play of the .brussels TLDN.

Simon Hengchen

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Challenges for Computational Lexical Semantic Change

Lexical semantic change for Ancient Greek and Latin

SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

GASC: Genre-Aware Semantic Change for Ancient Greek

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change

How hot is .brussels? Impact of the uptake of the .brussels top-level domain name extension