Source author record

Damien Lefortier

Damien Lefortier appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Retrieval physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

2works

3topics

3close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2013arXiv

Evolution of the Media Web

We present a detailed study of the part of the Web related to media content, i.e., the Media Web. Using publicly available data, we analyze the evolution of incoming and outgoing links from and to media pages. Based on our observations, we propose a new class of models for the appearance of new media content on the Web where different \textit{attractiveness} functions of nodes are possible including ones taken from well-known preferential attachment and fitness models. We analyze these models theoretically and empirically and show which ones realistically predict both the incoming degree distribution and the so-called \textit{recency property} of the Media Web, something that existing models did not do well. Finally we compare these models by estimating the likelihood of the real-world link graph from our data set given each model and obtain that models we introduce are significantly more likely than previously proposed ones. One of the most surprising results is that in the Media Web the probability for a post to be cited is determined, most likely, by its quality rather than by its current popularity.

preprint2013arXiv

Timely crawling of high-quality ephemeral new content

Nowadays, more and more people use the Web as their primary source of up-to-date information. In this context, fast crawling and indexing of newly created Web pages has become crucial for search engines, especially because user traffic to a significant fraction of these new pages (like news, blog and forum posts) grows really quickly right after they appear, but lasts only for several days. In this paper, we study the problem of timely finding and crawling of such ephemeral new pages (in terms of user interest). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a relatively small set of content sources and present a procedure for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. Efficiency of our approach is finally demonstrated experimentally on real-world data.

Damien Lefortier

What is connected

Connect this record

See the researcher in context

Building this map preview

2 published item(s)

Evolution of the Media Web

Timely crawling of high-quality ephemeral new content