Source author record

Raffaele Giancarlo

Raffaele Giancarlo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computational Engineering, Finance, and Science Data Structures and Algorithms Databases Distributed, Parallel, and Cluster Computing Genomics Information Retrieval Neural and Evolutionary Computing

Catalog footprint

What is connected

4works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

On the Suitability of Neural Networks as Building Blocks for The Design of Efficient Learned Indexes

With the aim of obtaining time/space improvements in classic Data Structures, an emerging trend is to combine Machine Learning techniques with the ones proper of Data Structures. This new area goes under the name of Learned Data Structures. The motivation for its study is a perceived change of paradigm in Computer Architectures that would favour the use of Graphics Processing Units and Tensor Processing Units over conventional Central Processing Units. In turn, that would favour the use of Neural Networks as building blocks of Classic Data Structures. Indeed, Learned Bloom Filters, which are one of the main pillars of Learned Data Structures, make extensive use of Neural Networks to improve the performance of classic Filters. However, no use of Neural Networks is reported in the realm of Learned Indexes, which is another main pillar of that new area. In this contribution, we provide the first, and much needed, comparative experimental analysis regarding the use of Neural Networks as building blocks of Learned Indexes. The results reported here highlight the need for the design of very specialized Neural Networks tailored to Learned Indexes and it establishes a solid ground for those developments. Our findings, methodologically important, are of interest to both Scientists and Engineers working in Neural Networks Design and Implementation, in view also of the importance of the application areas involved, e.g., Computer Networks and Data Bases.

preprint2020arXiv

FASTA/Q Data Compressors for MapReduce-Hadoop Genomics:Space and Time Savings Made Easy -- Version 1

Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in major cost savings, i.e., on large plant genomes, 30% less HDFS data blocks (one block=128MB), speed-up of at least x1.5 in I/O time and comparable or reduced network communication time with respect to the use of generic compressors available in Hadoop. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.

preprint2020arXiv

Learning from Data to Speed-up Sorted Table Search Procedures: Methodology and Practical Guidelines

Sorted Table Search Procedures are the quintessential query-answering tool, with widespread usage that now includes also Web Applications, e.g, Search Engines (Google Chrome) and ad Bidding Systems (AppNexus). Speeding them up, at very little cost in space, is still a quite significant achievement. Here we study to what extend Machine Learning Techniques can contribute to obtain such a speed-up via a systematic experimental comparison of known efficient implementations of Sorted Table Search procedures, with different Data Layouts, and their Learned counterparts developed here. We characterize the scenarios in which those latter can be profitably used with respect to the former, accounting for both CPU and GPU computing. Our approach contributes also to the study of Learned Data Structures, a recent proposal to improve the time/space performance of fundamental Data Structures, e.g., B-trees, Hash Tables, Bloom Filters. Indeed, we also formalize an Algorithmic Paradigm of Learned Dichotomic Sorted Table Search procedures that naturally complements the Learned one proposed here and that characterizes most of the known Sorted Table Search Procedures as having a "learning phase" that approximates Simple Linear Regression.

preprint2012arXiv

The Chromatin Organization of an Eukaryotic Genome : Sequence Specific+ Statistical=Combinatorial (Extended Abstract)

Nucleosome organization in eukaryotic genomes has a deep impact on gene function. Although progress has been recently made in the identification of various concurring factors influencing nucleosome positioning, it is still unclear whether nucleosome positions are sequence dictated or determined by a random process. It has been postulated for a long time that,in the proximity of TSS, a barrier determines the position of the +1 nucleosome and then geometric constraints alter the random positioning process determining nucleosomal phasing. Such a pattern fades out as one moves away from the barrier to become again a random positioning process. Although this statistical model is widely accepted,the molecular nature of the barrier is still unknown. Moreover,we are far from the identification of a set of sequence rules able:to account for the genome-wide nucleosome organization;to explain the nature of the barriers on which the statistical mechanism hinges;to allow for a smooth transition from sequence-dictated to statistical positioning and back. We show that sequence complexity,quantified via various methods, can be the rule able to at least partially account for all the above.In particular, we have conducted our analyses on 4 high resolution nucleosomal maps of the model eukaryotes and found that nucleosome depleted regions can be well distinguished from nucleosome enriched regions by sequence complexity measures.In particular, (a) the depleted regions are less complex than the enriched ones, (b) around TSS complexity measures alone are in striking agreement with in vivo nucleosome occupancy,in particular precisely indicating the positions of the +1 and -1 nucleosomes. Those findings indicate that the intrinsic richness of subsequences within sequences plays a role in nucleosomal formation in genomes, and that sequence complexity constitutes the molecular nature of nucleosome barrier.