Source author record

Josep Silva

Josep Silva appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Retrieval Software Engineering Programming Languages Formal Languages and Automata Theory Logic in Computer Science Networking and Internet Architecture

Catalog footprint

What is connected

8works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Benchmark Suite for Template Detection and Content Extraction

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

preprint2015arXiv

Web Template Extraction Based on Hyperlink Analysis

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

preprint2014arXiv

Automatic Detection of Webpages that Share the Same Web Template

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.

preprint2014arXiv

Transforming while/do/for/foreach-Loops into Recursive Methods

In software engineering, taking a good election between recursion and iteration is essential because their efficiency and maintenance are different. In fact, developers often need to transform iteration into recursion (e.g., in debugging, to decompose the call graph into iterations); thus, it is quite surprising that there does not exist a public transformation from loops to recursion that handles all kinds of loops. This article describes a transformation able to transform iterative loops into equivalent recursive methods. The transformation is described for the programming language Java, but it is general enough as to be adapted to many other languages that allow iteration and recursion. We describe the changes needed to transform loops of types while/do/for/foreach into recursion. Each kind of loop requires a particular treatment that is described and exemplified.

preprint2013arXiv

Proceedings 9th International Workshop on Automated Specification and Verification of Web Systems

This volume contains the accepted papers of the 9th International Workshop on Automated Specification and Verification of Web Systems (WWV'13), which took place in Florence, Italy, on June 6, as a satellite event of the 8th International Federated Conferences on Distributed Computing Techniques (DisCoTec 2013).

preprint2012arXiv

Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems

This volume contains the final and revised versions of the papers presented at the 8th International Workshop on Automated Specification and Verification of Web Systems (WWV 2012). The workshop was held in Stockholm, Sweden, on June 16, 2012, as part of DisCoTec 2012. WWV is a yearly workshop that aims at providing an interdisciplinary forum to facilitate the cross-fertilization and the advancement of hybrid methods that exploit concepts and tools drawn from Rule-based programming, Software engineering, Formal methods and Web-oriented research. WWV has a reputation for being a lively, friendly forum for presenting and discussing work in progress. The proceedings have been produced after the symposium to allow the authors to incorporate the feedback gathered during the event in the published papers. All papers submitted to the workshop were reviewed by at least three Program Committee members or external referees. The Program Committee held an electronic discussion leading to the acceptance of all papers for presentation at the workshop. In addition to the presentation of the contributed papers, the scientific programme included the invited talks by two outstanding speakers: Rocco De Nicola (IMT, Institute for Advanced Studies Lucca, Italy) and Josè Luiz Fiadeiro (Royal Holloway, United Kingdom).

preprint2012arXiv

Using the DOM Tree for Content Extraction

The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of content extraction, a research area of widely interest due to its many applications. Content extraction is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems that need to extract the main content in a web document to avoid the treatment and processing of other useless information. Other interesting application where content extraction is particularly used is displaying webpages in small screens such as mobile phones or PDAs. In this work we present a new technique for content extraction that uses the DOM tree of the webpage to analyze the hierarchical relations of the elements in the webpage. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks.

preprint2011arXiv

Optimal Divide and Query (extended version)

Algorithmic debugging is a semi-automatic debugging technique that allows the programmer to precisely identify the location of bugs without the need to inspect the source code. The technique has been successfully adapted to all paradigms and mature implementations have been released for languages such as Haskell, Prolog or Java. During three decades, the algorithm introduced by Shapiro and later improved by Hirunkitti has been thought optimal. In this paper we first show that this algorithm is not optimal, and moreover, in some situations it is unable to find all possible solutions, thus it is incomplete. Then, we present a new version of the algorithm that is proven optimal, and we introduce some equations that allow the algorithm to identify all optimal solutions.

Josep Silva

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

A Benchmark Suite for Template Detection and Content Extraction

Web Template Extraction Based on Hyperlink Analysis

Automatic Detection of Webpages that Share the Same Web Template

Transforming while/do/for/foreach-Loops into Recursive Methods

Proceedings 9th International Workshop on Automated Specification and Verification of Web Systems

Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems

Using the DOM Tree for Content Extraction

Optimal Divide and Query (extended version)