Source author record

Casper Petersen

Casper Petersen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

5works
3topics
4close collaborators

Actions

Connect this record

Log in to claim

Research graph

See the researcher in context

Open full explorer

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2016arXiv

Adaptive Distributional Extensions to DFR Ranking

Divergence From Randomness (DFR) ranking models assume that informative terms are distributed in a corpus differently than non-informative terms. Different statistical models (e.g. Poisson, geometric) are used to model the distribution of non-informative terms, producing different DFR models. An informative term is then detected by measuring the divergence of its distribution from the distribution of non-informative terms. However, there is little empirical evidence that the distributions of non-informative terms used in DFR actually fit current datasets. Practically this risks providing a poor separation between informative and non-informative terms, thus compromising the discriminative power of the ranking model. We present a novel extension to DFR, which first detects the best-fitting distribution of non-informative terms in a collection, and then adapts the ranking computation to this best-fitting distribution. We call this model Adaptive Distributional Ranking (ADR) because it adapts the ranking to the statistics of the specific dataset being processed each time. Experiments on TREC data show ADR to outperform DFR models (and their extensions) and be comparable in performance to a query likelihood language model (LM).

preprint2016arXiv

Deep Learning Relevance: Creating Relevant Information (as Opposed to Retrieving it)

What if Information Retrieval (IR) systems did not just retrieve relevant information that is stored in their indices, but could also "understand" it and synthesise it into a single document? We present a preliminary study that makes a first step towards answering this question. Given a query, we train a Recurrent Neural Network (RNN) on existing relevant information to that query. We then use the RNN to "deep learn" a single, synthetic, and we assume, relevant document for that query. We design a crowdsourcing experiment to assess how relevant the "deep learned" document is, compared to existing relevant documents. Users are shown a query and four wordclouds (of three existing relevant documents and our deep learned synthetic document). The synthetic document is ranked on average most relevant of all.

preprint2016arXiv

Exploiting the Bipartite Structure of Entity Grids for Document Coherence and Retrieval

Document coherence describes how much sense text makes in terms of its logical organisation and discourse flow. Even though coherence is a relatively difficult notion to quantify precisely, it can be approximated automatically. This type of coherence modelling is not only interesting in itself, but also useful for a number of other text processing tasks, including Information Retrieval (IR), where adjusting the ranking of documents according to both their relevance and their coherence has been shown to increase retrieval effectiveness [34,37]. The state of the art in unsupervised coherence modelling represents documents as bipartite graphs of sentences and discourse entities, and then projects these bipartite graphs into one-mode undirected graphs. However, one-mode projections may incur significant loss of the information present in the original bipartite structure. To address this we present three novel graph metrics that compute document coherence on the original bipartite graph of sentences and entities. Evaluation on standard settings shows that: (i) one of our coherence metrics beats the state of the art in terms of coherence accuracy; and (ii) all three of our coherence metrics improve retrieval effectiveness because, as closer analysis reveals, they capture aspects of document quality that go undetected by both keyword-based standard ranking and by spam filtering. This work contributes document coherence metrics that are theoretically principled, parameter-free, and useful to IR.

preprint2015arXiv

Entropy and Graph Based Modelling of Document Coherence using Discourse Entities: An Application

We present two novel models of document coherence and their application to information retrieval (IR). Both models approximate document coherence using discourse entities, e.g. the subject or object of a sentence. Our first model views text as a Markov process generating sequences of discourse entities (entity n-grams); we use the entropy of these entity n-grams to approximate the rate at which new information appears in text, reasoning that as more new words appear, the topic increasingly drifts and text coherence decreases. Our second model extends the work of Guinaudeau & Strube [28] that represents text as a graph of discourse entities, linked by different relations, such as their distance or adjacency in text. We use several graph topology metrics to approximate different aspects of the discourse flow that can indicate coherence, such as the average clustering or betweenness of discourse entities in text. Experiments with several instantiations of these models show that: (i) our models perform on a par with two other well-known models of text coherence even without any parameter tuning, and (ii) reranking retrieval results according to their coherence scores gives notable performance gains, confirming a relation between document coherence and relevance. This work contributes two novel models of document coherence, the application of which to IR complements recent work in the integration of document cohesiveness or comprehensibility to ranking [5, 56].

preprint2015arXiv

Near-optimal adjacency labeling scheme for power-law graphs

An adjacency labeling scheme is a method that assigns labels to the vertices of a graph such that adjacency between vertices can be inferred directly from the assigned label, without using a centralized data structure. We devise adjacency labeling schemes for the family of power-law graphs. This family that has been used to model many types of networks, e.g. the Internet AS-level graph. Furthermore, we prove an almost matching lower bound for this family. We also provide an asymptotically near- optimal labeling scheme for sparse graphs. Finally, we validate the efficiency of our labeling scheme by an experimental evaluation using both synthetic data and real-world networks of up to hundreds of thousands of vertices.