Researcher profile

Knut Reinert

Knut Reinert contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2020arXiv

Portability of Scientific Workflows in NGS Data Analysis: A Case Study

The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a workflow developed for a particular system on a particular hardware infrastructure to another system or to another infrastructure is non-trivial, which poses a major impediment to the scientific necessities of workflow reproducibility and workflow reusability. In this work, we describe our efforts to port a state-of-the-art workflow for the detection of specific variants in whole-exome sequencing of mice. The workflow originally was developed in the scientific workflow system snakemake for execution on a high-performance cluster controlled by Sun Grid Engine. In the project, we ported it to the scientific workflow system SaasFee that can execute workflows on (multi-core) stand-alone servers or on clusters of arbitrary sizes using the Hadoop. The purpose of this port was that also owners of low-cost hardware infrastructures, for which Hadoop was made for, become able to use the workflow. Although both the source and the target system are called scientific workflow systems, they differ in numerous aspects, ranging from the workflow languages to the scheduling mechanisms and the file access interfaces. These differences resulted in various problems, some expected and more unexpected, that had to be resolved before the workflow could be run with equal semantics. As a side-effect, we also report cost/runtime ratios for a state-of-the-art NGS workflow on very different hardware platforms: A comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster (552 threads), and a high-end HPC system (3784 threads).

preprint2012arXiv

Fast and sensitive read mapping with approximate seeds and multiple backtracking

We present Masai, a read mapper representing the state of the art in terms of speed and sensitivity. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2--3 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared to exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic datasets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai.

preprint2012arXiv

Hidden breakpoints in genome alignments

During the course of evolution, an organism's genome can undergo changes that affect the large-scale structure of the genome. These changes include gene gain, loss, duplication, chromosome fusion, fission, and rearrangement. When gene gain and loss occurs in addition to other types of rearrangement, breakpoints of rearrangement can exist that are only detectable by comparison of three or more genomes. An arbitrarily large number of these "hidden" breakpoints can exist among genomes that exhibit no rearrangements in pairwise comparisons. We present an extension of the multichromosomal breakpoint median problem to genomes that have undergone gene gain and loss. We then demonstrate that the median distance among three genomes can be used to calculate a lower bound on the number of hidden breakpoints present. We provide an implementation of this calculation including the median distance, along with some practical improvements on the time complexity of the underlying algorithm. We apply our approach to measure the abundance of hidden breakpoints in simulated data sets under a wide range of evolutionary scenarios. We demonstrate that in simulations the hidden breakpoint counts depend strongly on relative rates of inversion and gene gain/loss. Finally we apply current multiple genome aligners to the simulated genomes, and show that all aligners introduce a high degree of error in hidden breakpoint counts, and that this error grows with evolutionary distance in the simulation. Our results suggest that hidden breakpoint error may be pervasive in genome alignments.

preprint2011arXiv

Antilope - A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem

Peptide sequencing from mass spectrometry data is a key step in proteome research. Especially de novo sequencing, the identification of a peptide from its spectrum alone, is still a challenge even for state-of-the-art algorithmic approaches. In this paper we present Antilope, a new fast and flexible approach based on mathematical programming. It builds on the spectrum graph model and works with a variety of scoring schemes. Antilope combines Lagrangian relaxation for solving an integer linear programming formulation with an adaptation of Yen's k shortest paths algorithm. It shows a significant improvement in running time compared to mixed integer optimization and performs at the same speed like other state-of-the-art tools. We also implemented a generic probabilistic scoring scheme that can be trained automatically for a dataset of annotated spectra and is independent of the mass spectrometer type. Evaluations on benchmark data show that Antilope is competitive to the popular state-of-the-art programs PepNovo and NovoHMM both in terms of run time and accuracy. Furthermore, it offers increased flexibility in the number of considered ion types. Antilope will be freely available as part of the open source proteomics library OpenMS.