Source author record

Olga Vitek

Olga Vitek appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Quantitative Methods Artificial Intelligence Human-Computer Interaction Machine Learning Software Engineering

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Leveraging Structured Biological Knowledge for Counterfactual Inference: a Case Study of Viral Pathogenesis

Counterfactual inference is a useful tool for comparing outcomes of interventions on complex systems. It requires us to represent the system in form of a structural causal model, complete with a causal diagram, probabilistic assumptions on exogenous variables, and functional assignments. Specifying such models can be extremely difficult in practice. The process requires substantial domain expertise, and does not scale easily to large systems, multiple systems, or novel system modifications. At the same time, many application domains, such as molecular biology, are rich in structured causal knowledge that is qualitative in nature. This manuscript proposes a general approach for querying a causal biological knowledge graph, and converting the qualitative result into a quantitative structural causal model that can learn from data to answer the question. We demonstrate the feasibility, accuracy and versatility of this approach using two case studies in systems biology. The first demonstrates the appropriateness of the underlying assumptions and the accuracy of the results. The second demonstrates the versatility of the approach by querying a knowledge base for the molecular determinants of a severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-induced cytokine storm, and performing counterfactual inference to estimate the causal effect of medical countermeasures for severely ill patients.

preprint2020arXiv

Investigating usability of MSstatsQC software

MSstatsQC [3] is an open-source software that provides longitudinal system suitability monitoring tools in the form of control charts for proteomic experiments. It includes simultaneous tools for the mean and dispersion of suitability metrics and presents alternative methods of monitoring through different tabs that are designed in the interface. This research focuses on investigating the usability of MSstatsQC software and the interpretability of the designed plots. In this study, we ask 4 test users, from the proteomics field, to complete a series of tasks and questionnaires. The tasks are designed to test the usability of the software in terms of importing data files, selecting appropriate metrics, guide set, and peptides, and finally creating decision rules (tasks 1 and 3 in appendix). The questionnaires ask about interpretability of the plots including control charts, box plots, heat maps, river plots, and radar plots (tasks 1 and 4 in appendix). The goal of the questions is to determine if the test users understand the plots and can interpret them. Results show limitations in usability and plot interpretability, especially in the data import section. We suggest the following modifications. I) providing conspicuous guides close to the window related to up-loading a datafile as well as providing error messages that pop-up when the data set has a wrong format II) providing plot descriptions, hints to interpret plots, plot titles and appropriate axis labels, and, III) Numbering tabs to show the flow of procedures in the software.

preprint2020arXiv

New mixture models for decoy-free false discovery rate estimation in mass-spectrometry proteomics

Motivation: Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target decoy approaches (TDAs) and decoy-free approaches (DFAs), have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results: We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs, and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms.

preprint2019arXiv

On the Impact of Programming Languages on Code Quality

This paper is a reproduction of work by Ray et al. which claimed to have uncovered a statistically significant association between eleven programming languages and software defects in projects hosted on GitHub. First we conduct an experimental repetition, repetition is only partially successful, but it does validate one of the key claims of the original work about the association of ten programming languages with defects. Next, we conduct a complete, independent reanalysis of the data and statistical modeling steps of the original study. We uncover a number of flaws that undermine the conclusions of the original study as only four languages are found to have a statistically significant association with defects, and even for those the effect size is exceedingly small. We conclude with some additional sources of bias that should be investigated in follow up work and a few best practice recommendations for similar efforts.