Topic overview

Genomics

360 works1652 researchers

Open map Browse papers

Map preview

Start with the graph, then narrow the list

360works

1652researchers

Next steps

Use the topic as a working map

Open the full map for clusters, then return here to scan ranked papers and people.

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Discovery of cancer common and specific driver gene sets

Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. In this study, we propose two optimization models to \emph{de novo} discover common driver gene sets among multiple cancer types (ComMDP) and specific driver gene sets of one certain or multiple cancer types to other cancers (SpeMDP), respectively. We first apply ComMDP and SpeMDP to simulated data to validate their efficiency. Then, we further apply these methods to 12 cancer types from The Cancer Genome Atlas (TCGA) and obtain several biologically meaningful driver pathways. As examples, we construct a common cancer pathway model for BRCA and OV, infer a complex driver pathway model for BRCA carcinogenesis based on common driver gene sets of BRCA with eight cancer types, and investigate s

preprint2017arXiv

A Computational Approach to Finding RNA Tertiary Motifs in Genomic Sequences

Motif finding in DNA, RNA and proteins plays an important role in life science research. Recent patents concerning motif finding in the biomolecular data are recorded in the DNA Patent Database which serves as a resource for policy makers and members of the general public interested in fields like genomics, genetics and biotechnology. In this paper we present a computational approach to mining for RNA tertiary motifs in genomic sequences. Specifically we describe a method, named CSminer, for finding RNA coaxial helical stackings in genomes. A coaxial helical stacking occurs in an RNA tertiary structure where two separate helical elements form a pseudocontiguous helix and provides thermodynamic stability to the molecule as a whole. Experimental results demonstrate the effectiveness of our approach.

preprint2017arXiv

HSEARCH: fast and accurate protein sequence motif search and clustering

Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Most existing methods are not fast enough to analyze large data sets for motif finding or achieve low accuracy for motif clustering. We present a new protein sequence motif finding and clustering algorithm, called HSEARCH. It converts fixed length protein sequences to data points in high dimensional space, and applies locality-sensitive hashing to fast search homologous protein sequences for a motif. HSEARCH is significantly faster than the brute force algorithm for protein motif finding and achieves high accuracy for protein motif clustering.

preprint2005arXiv

Impact of Tandem Repeats on the Scaling of Nucleotide Sequences

Techniques such as detrended fluctuation analysis (DFA) and its extensions have been widely used to determine the nature of scaling in nucleotide sequences. In this brief communication we show that tandem repeats which are ubiquitous in nucleotide sequences can prevent reliable estimation of possible long-range correlations. Therefore, it is important to investigate the presence of tandem repeats prior to scaling exponent estimation.

preprint2008arXiv

Towards Understanding the Origin of Genetic Languages

Molecular biology is a nanotechnology that works--it has worked for billions of years and in an amazing variety of circumstances. At its core is a system for acquiring, processing and communicating information that is universal, from viruses and bacteria to human beings. Advances in genetics and experience in designing computers have taken us to a stage where we can understand the optimisation principles at the root of this system, from the availability of basic building blocks to the execution of tasks. The languages of DNA and proteins are argued to be the optimal solutions to the information processing tasks they carry out. The analysis also suggests simpler predecessors to these languages, and provides fascinating clues about their origin. Obviously, a comprehensive unraveling of the puzzle of life would have a lot to say about what we may design or convert ourselves into.

preprint2016arXiv

Info-Clustering: A Mathematical Theory for Data Clustering

We formulate an info-clustering paradigm based on a multivariate information measure, called multivariate mutual information, that naturally extends Shannon's mutual information between two random variables to the multivariate case involving more than two random variables. With proper model reductions, we show that the paradigm can be applied to study the human genome and connectome in a more meaningful way than the conventional algorithmic approach. Not only can info-clustering provide justifications and refinements to some existing techniques, but it also inspires new computationally feasible solutions.

preprint2016arXiv

DeepCancer: Detecting Cancer through Gene Expressions via Deep Generative Learning

Transcriptional profiling on microarrays to obtain gene expressions has been used to facilitate cancer diagnosis. We propose a deep generative machine learning architecture (called DeepCancer) that learn features from unlabeled microarray data. These models have been used in conjunction with conventional classifiers that perform classification of the tissue samples as either being cancerous or non-cancerous. The proposed model has been tested on two different clinical datasets. The evaluation demonstrates that DeepCancer model achieves a very high precision score, while significantly controlling the false positive and false negative scores.

preprint2016arXiv

Large scale modeling of antimicrobial resistance with interpretable classifiers

Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of treatment plans tailored for specific individuals, likely resulting in better clinical outcomes for patients with bacterial infections. In this work, we present the recent work of Drouin et al. (2016) on using Set Covering Machines to learn highly interpretable models of antibiotic resistance and complement it by providing a large scale application of their method to the entire PATRIC database. We report prediction results for 36 new datasets and present the Kover AMR platform, a new web-based tool allowing the visualization and interpretation of the generated models.

preprint2016arXiv

A Noise-Filtering Approach for Cancer Drug Sensitivity Prediction

Accurately predicting drug responses to cancer is an important problem hindering oncologists' efforts to find the most effective drugs to treat cancer, which is a core goal in precision medicine. The scientific community has focused on improving this prediction based on genomic, epigenomic, and proteomic datasets measured in human cancer cell lines. Real-world cancer cell lines contain noise, which degrades the performance of machine learning algorithms. This problem is rarely addressed in the existing approaches. In this paper, we present a noise-filtering approach that integrates techniques from numerical linear algebra and information retrieval targeted at filtering out noisy cancer cell lines. By filtering out noisy cancer cell lines, we can train machine learning algorithms on better quality cancer cell lines. We evaluate the performance of our approach and compare it with an existing approach using the Area Under the ROC Curve (AUC) on clinical trial data. The experimental results show that our proposed approach is stable and also yields the highest AUC at a statistically significant level.

preprint2016arXiv

Gene Ontology: Pitfalls, Biases, Remedies

The Gene Ontology (GO) is a formidable resource but there are several considerations about it that are essential to understand the data and interpret it correctly. The GO is sufficiently simple that it can be used without deep understanding of its structure or how it is developed, which is both a strength and a weakness. In this chapter, we discuss some common misinterpretations of the ontology and the annotations. A better understanding of the pitfalls and the biases in the GO should help users make the most of this very rich resource. We also review some of the misconceptions and misleading assumptions commonly made about GO, including the effect of data incompleteness, the importance of annotation qualifiers, and the transitivity or lack thereof associated with different ontology relations. We also discuss several biases that can confound aggregate analyses such as gene enrichment analyses. For each of these pitfalls and biases, we suggest remedies and best practices.

preprint2016arXiv

Primer on the Gene Ontology

The Gene Ontology (GO) project is the largest resource for cataloguing gene function. The combination of solid conceptual underpinnings and a practical set of features have made the GO a widely adopted resource in the research community and an essential resource for data analysis. In this chapter, we provide a concise primer for all users of the GO. We briefly introduce the structure of the ontology and explain how to interpret annotations associated with the GO.

preprint2016arXiv

Multi-stage Clustering of Breast Cancer for Precision Medicine

Cancer has become one of the most widespread diseases in the world. Specifically, breast cancer is diagnosed more often than any other type of cancer. However, breast cancer patients and their individual tumors are often unique. Identifying the underlying genetic phenotype can lead to precision (personalized) medicine. Tailoring medical treatment strategies to best fit the needs of individual patients can dramatically improve their health. Such an approach requires sufficient knowledge of the patients and the diseases, which is currently unavailable to practitioners. This study focuses on breast cancer and proposes a novel two-stage clustering method to partition patients into hierarchical groups. The first stage is broad grouping, which is based on phenotypes such as demographic information and clinical features. The second stage is fine grouping based on genomic characteristics, such as copy number variation and somatic mutation, of patients in a subgroup resulting from the first stage. Generally, this framework offers a mechanism to mix multiple forms of data, both phenotypic and genomic, to most effectively define individual patients for personalized predictions. This method pr

preprint2016arXiv

Critical dynamics of gene networks is a mechanism behind ageing and Gompertz law

Although accumulation of molecular damage is suggested to be an important molecular mechanism of aging, a quantitative link between the dynamics of damage accumulation and mortality of species has so far remained elusive. To address this question, we examine stability properties of a generic gene regulatory network (GRN) and demonstrate that many characteristics of aging and the associated population mortality rate emerge as inherent properties of the critical dynamics of gene regulation and metabolic levels. Based on the analysis of age-dependent changes in gene-expression and metabolic profiles in Drosophila melanogaster, we explicitly show that the underlying GRNs are nearly critical and inherently unstable. This instability manifests itself as aging in the form of distortion of gene expression and metabolic profiles with age, and causes the characteristic increase in mortality rate with age as described by a form of the Gompertz law. In addition, we explain late-life mortality deceleration observed at very late ages for large populations. We show that aging contains a stochastic component, related to accumulation of regulatory errors in transcription/translation/metabolic pathw

preprint2015arXiv

Capacity and Expressiveness of Genomic Tandem Duplication

The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a tandem duplication system as the capability of expressing arbitrary substrings. We then \textit{completely} characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting

preprint2014arXiv

Genetic Sequence Matching Using D4M Big Data Approaches

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.

preprint2016arXiv

Fast low-level pattern matching algorithm

This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Our approach successfully deals with large patterns, due to our implementation that uses modular arithmetic. In order to get the results very fast, the code was adapted for multithreading and parallel implementations. The method is reduced to assembly language level instructions, thus the final result shows significant time and memory savings compared to the reference algorithm.

preprint2011arXiv

Dependency detection with similarity constraints

Unsupervised two-view learning, or detection of dependencies between two paired data sets, is typically done by some variant of canonical correlation analysis (CCA). CCA searches for a linear projection for each view, such that the correlations between the projections are maximized. The solution is invariant to any linear transformation of either or both of the views; for tasks with small sample size such flexibility implies overfitting, which is even worse for more flexible nonparametric or kernel-based dependency discovery methods. We develop variants which reduce the degrees of freedom by assuming constraints on similarity of the projections in the two views. A particular example is provided by a cancer gene discovery application where chromosomal distance affects the dependencies between gene copy number and activity levels. Similarity constraints are shown to improve detection performance of known cancer genes.

preprint2016arXiv

Duplication Distance to the Root for Binary Sequences

We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length $n$ by $f(n)$, we prove that $f(n)=Θ(n)$. For the case of approximate duplication, where a $β$-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in $n$ to logarithmic at $β=1/2$. We also study the duplication distance to the root for sequences with a given root and f

preprint2010arXiv

On Critical Relative Distance of DNA Codes for Additive Stem Similarity

We consider DNA codes based on the nearest-neighbor (stem) similarity model which adequately reflects the "hybridization potential" of two DNA sequences. Our aim is to present a survey of bounds on the rate of DNA codes with respect to a thermodynamically motivated similarity measure called an additive stem similarity. These results yield a method to analyze and compare known samples of the nearest neighbor "thermodynamic weights" associated to stacked pairs that occurred in DNA secondary structures.

preprint2016arXiv

Higher order methylation features for clustering and prediction in epigenomic studies

Motivation: DNA methylation is an intensely studied epigenetic mark, yet its functional role is incompletely understood. Attempts to quantitatively associate average DNA methylation to gene expression yield poor correlations outside of the well-understood methylation-switch at CpG islands. Results: Here we use probabilistic machine learning to extract higher order features associated with the methylation profile across a defined region. These features quantitate precisely notions of shape of a methylation profile, capturing spatial correlations in DNA methylation across genomic regions. Using these higher order features across promoter-proximal regions, we are able to construct a powerful machine learning predictor of gene expression, significantly improving upon the predictive power of average DNA methylation levels. Furthermore, we can use higher order features to cluster promoter-proximal regions, showing that five major patterns of methylation occur at promoters across different cell lines, and we provide evidence that methylation beyond CpG islands may be related to regulation of gene expression. Our results support previous reports of a functional role of spatial correlations

preprint2009arXiv

Compressed Genotyping

Significant volumes of knowledge have been accumulated in recent years linking subtle genetic variations to a wide variety of medical disorders from Cystic Fibrosis to mental retardation. Nevertheless, there are still great challenges in applying this knowledge routinely in the clinic, largely due to the relatively tedious and expensive process of DNA sequencing. Since the genetic polymorphisms that underlie these disorders are relatively rare in the human population, the presence or absence of a disease-linked polymorphism can be thought of as a sparse signal. Using methods and ideas from compressed sensing and group testing, we have developed a cost-effective genotyping protocol. In particular, we have adapted our scheme to a recently developed class of high throughput DNA sequencing technologies, and assembled a mathematical framework that has some important distinctions from 'traditional' compressed sensing ideas in order to address different biological and technical constraints.

preprint2016arXiv

Stochastic predator-prey dynamics of transposons in the human genome

Transposable elements, or transposons, are DNA sequences that can jump from site to site in the genome during the life cycle of a cell, usually encoding the very enzymes which perform their excision. However, some transposons are parasitic, relying on the enzymes produced by the regular transposons. In this case, we show that a stochastic model, which takes into account the small copy numbers of the transposons in a cell, predicts noise-induced predator-prey oscillations with a characteristic time scale that is much longer than the cell replication time, indicating that the state of the predator-prey oscillator is stored in the genome and transmitted to successive generations. Our work demonstrates the important role of number fluctuations in the expression of mobile genetic elements, and shows explicitly how ecological concepts can be applied to the dynamics and fluctuations of living genomes.

preprint2005arXiv

On the Complexity of the Single Individual SNP Haplotyping Problem

We present several new results pertaining to haplotyping. These results concern the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype fragments. We consider the complexity of the problems Minimum Error Correction (MEC) and Longest Haplotype Reconstruction (LHR) for different restrictions on the input data. Specifically, we look at the gapless case, where every row of the input corresponds to a gapless haplotype-fragment, and the 1-gap case, where at most one gap per fragment is allowed. We prove that MEC is APX-hard in the 1-gap case and still NP-hard in the gapless case. In addition, we question earlier claims that MEC is NP-hard even when the input matrix is restricted to being completely binary. Concerning LHR, we show that this problem is NP-hard and APX-hard in the 1-gap case (and thus also in the general case), but is polynomial time solvable in the gapless case.

preprint2016arXiv

Genomic Region Detection via Spatial Convex Clustering

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiplechromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning paramete

602 works