Source author record

Dan Graur

Dan Graur appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Populations and Evolution Distributed, Parallel, and Cluster Computing Genomics Machine Learning Quantitative Methods

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

tf.data service: A Case for Disaggregating ML Input Data Processing

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.

preprint2016arXiv

Rubbish DNA: The functionless fraction of the human genome

Because genomes are products of natural processes rather than intelligent design, all genomes contain functional and nonfunctional parts. The fraction of the genome that has no biological function is called rubbish DNA. Rubbish DNA consists of junk DNA, i.e., the fraction of the genome on which selection does not operate, and garbage DNA, i.e., sequences that lower the fitness of the organism, but exist in the genome because purifying selection is neither omnipotent nor instantaneous. In this chapter, I (1) review the concepts of genomic function and functionlessness from an evolutionary perspective, (2) present a precise nomenclature of genomic function, (3) discuss the evidence for the existence of vast quantities of junk DNA within the human genome, (4) discuss the mutational mechanisms responsible for generating junk DNA, (5) spell out the necessary evolutionary conditions for maintaining junk DNA, (6) outline various methodologies for estimating the functional fraction within the genome, and (7) present a recent estimate for the functional fraction of our genome.

preprint2015arXiv

A scale-free method for testing the proportionality of branch lengths between two phylogenetic trees

We introduce a scale-free method for testing the proportionality of branch lengths between two phylogenetic trees that have the same topology and contain the same set of taxa. This method scales both trees to a total length of 1 and sums up the differences for each branch. Compared to previous methods, ours yields a fully symmetrical score that measures proportionality without being affected by scale. We call this score the normalized tree distance (NTD). Based on real data, we demonstrate that NTD scores are distributed unimodally, in a manner similar to a lognormal distribution. The NTD score can be used to, for example, detect co-evolutionary processes and measure the accuracy of branch length estimates.

preprint2014arXiv

An extended reply to Mendez et al.: The 'extremely ancient' chromosome that still isn't

Earlier this year, we published a scathing critique of a paper by Mendez et al. (2013) in which the claim was made that a Y chromosome was 237,000-581,000 years old. Elhaik et al. (2014) also attacked a popular article in Scientific American by the senior author of Mendez et al. (2013), whose title was "Sex with other human species might have been the secret of Homo sapiens's [sic] success" (Hammer 2013). Five of the 11 authors of Mendez et al. (2013) have now written a "rebuttal," and we were allowed to reply. Unfortunately, our reply was censored for being "too sarcastic and inflamed." References were removed, meanings were castrated, and a dedication in the Acknowledgments was deleted. Now, that the so-called rebuttal by 45% of the authors of Mendez et al. (2013) has been published together with our vasectomized reply, we decided to make public our entire reply to the so called "rebuttal." In fact, we go one step further, and publish a version of the reply that has not even been self-censored.