Topic overview

Applications

2178 works7258 researchers0 institutions

Topic snapshot

What this area looks like now

2178works
7258authors
0experts visible
0communities

Next steps

Move from topic reading into action

The graph preview below keeps the nearby papers, people and communities visible in the same reading flow.

Topic graph

See the topic as a live network

Open full explorer

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Papers in this area

24 featured work(s)

preprint2014arXiv

Efficient Modeling and Forecasting of the Electricity Spot Price

The increasing importance of renewable energy, especially solar and wind power, has led to new forces in the formation of electricity prices. Hence, this paper introduces an econometric model for the hourly time series of electricity prices of the European Power Exchange (EPEX) which incorporates specific features like renewable energy. The model consists of several sophisticated and established approaches and can be regarded as a periodic VAR-TARCH with wind power, solar power, and load as influences on the time series. It is able to map the distinct and well-known features of electricity prices in Germany. An efficient iteratively reweighted lasso approach is used for the estimation. Moreover, it is shown that several existing models are outperformed by the procedure developed in this paper.

preprint2015arXiv

The non-linear health consequences of living in larger cities

Urbanization promotes economy, mobility, access and availability of resources, but on the other hand, generates higher levels of pollution, violence, crime, and mental distress. The health consequences of the agglomeration of people living close together are not fully understood. Particularly, it remains unclear how variations in the population size across cities impact the health of the population. We analyze the deviations from linearity of the scaling of several health-related quantities, such as the incidence and mortality of diseases, external causes of death, wellbeing, and health-care availability, in respect to the population size of cities in Brazil, Sweden and the USA. We find that deaths by non-communicable diseases tend to be relatively less common in larger cities, whereas the per-capita incidence of infectious diseases is relatively larger for increasing population size. Healthier life style and availability of medical support are disproportionally higher in larger cities. The results are connected with the optimization of human and physical resources, and with the non-linear effects of social networks in larger populations. An urban advantage in terms of health is not evident and using rates as indicators to compare cities with different population sizes may be insufficient.

preprint2015arXiv

Respondent-driven sampling bias induced by clustering and community structure in social networks

Sampling hidden populations is particularly challenging using standard sampling methods mainly because of the lack of a sampling frame. Respondent-driven sampling (RDS) is an alternative methodology that exploits the social contacts between peers to reach and weight individuals in these hard-to-reach populations. It is a snowball sampling procedure where the weight of the respondents is adjusted for the likelihood of being sampled due to differences in the number of contacts. In RDS, the structure of the social contacts thus defines the sampling process and affects its coverage, for instance by constraining the sampling within a sub-region of the network. In this paper we study the bias induced by network structures such as social triangles, community structure, and heterogeneities in the number of contacts, in the recruitment trees and in the RDS estimator. We simulate different scenarios of network structures and response-rates to study the potential biases one may expect in real settings. We find that the prevalence of the estimated variable is associated with the size of the network community to which the individual belongs. Furthermore, we observe that low-degree nodes may be under-sampled in certain situations if the sample and the network are of similar size. Finally, we also show that low response-rates lead to reasonably accurate average estimates of the prevalence but generate relatively large biases.

preprint2015arXiv

Forecasting day ahead electricity spot prices: The impact of the EXAA to other European electricity markets

In our paper we analyze the relationship between the day-ahead electricity price of the Energy Exchange Austria (EXAA) and other day-ahead electricity prices in Europe. We focus on markets, which settle their prices after the EXAA, which enables traders to include the EXAA price into their calculations. For each market we employ econometric models to incorporate the EXAA price and compare them with their counterparts without the price of the Austrian exchange. By employing a forecasting study, we find that electricity price models can be improved when EXAA prices are considered.

preprint2018arXiv

Spatio-temporal Patterns of Indian Monsoon Rainfall

The primary objective of this paper is to analyze a set of canonical spatial patterns that approximate the daily rainfall across the Indian region, as identified in the companion paper where we developed a discrete representation of the Indian summer monsoon rainfall using state variables with spatio-temporal coherence maintained using a Markov Random Field prior. In particular, we use these spatio-temporal patterns to study the variation of rainfall during the monsoon season. Firstly, the ten patterns are divided into three families of patterns distinguished by their total rainfall amount and geographic spread. These families are then used to establish `active' and `break' spells of the Indian monsoon at the all-India level. Subsequently, we characterize the behavior of these patterns in time by estimating probabilities of transition from one pattern to another across days in a season. Patterns tend to be `sticky': the self-transition is the most common. We also identify most commonly occurring sequences of patterns. This leads to a simple seasonal evolution model for the summer monsoon rainfall. The discrete representation introduced in the companion paper also identifies typical temporal rainfall patterns for individual locations. This enables us to determine wet and dry spells at local and regional scales. Lastly, we specify sets of locations that tend to have such spells simultaneously, and thus come up with a new regionalization of the landmass.

preprint2017arXiv

Dissimilar Symmetric Word Pairs in the Human Genome

In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study on the complete human genome and its repeat-masked version.

preprint2016arXiv

Global Estimation of Neonatal Mortality using a Bayesian Hierarchical Splines Regression Model

In recent years, much of the focus in monitoring child mortality has been on assessing changes in the under-five mortality rate (U5MR). However, as the U5MR decreases, the share of neonatal deaths (within the first month) tends to increase, warranting increased efforts in monitoring this indicator in addition to the U5MR. A Bayesian splines regression model is presented for estimating neonatal mortality rates (NMR) for all countries. In the model, the relationship between NMR and U5MR is assessed and used to inform estimates, and spline regression models are used to capture country-specific trends. As such, the resulting NMR estimates incorporate trends in overall child mortality while also capturing data-driven trends. The model is fitted to 195 countries using the database from the United Nations Interagency Group for Child Mortality Estimation, producing estimates from 1990, or earlier if data are available, until 2015. The results suggest that, above a U5MR of 34 deaths per 1000 live births, at the global level, a 1 per cent increase in the U5MR leads to a 0.6 per cent decrease in the ratio of NMR to U5MR. Below a U5MR of 34 deaths per 1000 live births, the proportion of deaths under-five that are neonatal is constant at around 54 per cent. However, the relationship between U5MR and NMR varies across countries. The model has now been adopted by the United Nations Inter-agency Group for Child Mortality Estimation.

preprint2018arXiv

Assessing student's achievement gap between ethnic groups in Brazil

Achievement gaps refer to the difference in the performance on examinations of students belonging to different social groups. Achievement gaps between ethnic groups have been observed in several countries with heterogeneous populations. In this paper, we analyze achievement gaps between ethnic populations in Brazil by studying the performance of a large cohort of senior high-school students in a standardized national exam. We separate ethnic groups into the Brazilian states to remove potential biases associated to infrastructure and financial resources, cultural background and ethnic clustering. We focus on the disciplines of mathematics and writing that involve different cognitive functions. We estimate the gaps and their statistical significance through the Welch&#39;s t-test and study key socio-economic variables that may explain the existence or absence of gaps. We identify that gaps between ethnic groups are either statistically insignificant (p<.01) or small (2%-6%) if statistically significant, for students living in households with low income. Increasing gaps however may be observed for higher income. On the other hand, while higher parental education is associated to higher performance, it may either increase, decrease or maintain the gaps between White and Black, and between White and Pardo students. Our results support that socio-economic variables have major impact on student&#39;s performance in both mathematics and writing examinations irrespectively of ethnic backgrounds, giving evidence that genetic factors have little or no effect on ethnic group performance when students are exposed to similar cultural and financial contexts.

preprint2018arXiv

Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data

Correct classification of breast cancer sub-types is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer (TNBC) which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma (BRCA) transcriptomic data publicly available from The Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail in the presence of these outliers, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60\% have been previously reported as biologically relevant to TNBC, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between TNBC and non-TNBC data. The individual role of FOXA1 in TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not only will our results contribute to the breast cancer/TNBC understanding and ultimately its management, they also show that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data.

preprint2018arXiv

Pollution State Modeling for Mexico City

Ground-level ozone and particulate matter pollutants are associated with a variety of health issues and increased mortality. For this reason, Mexican environmental agencies regulate pollutant levels. In addition, Mexico City defines pollution emergencies using thresholds that rely on regional maxima for ozone and particulate matter with diameter less than 10 micrometers ($\text{PM}_{10}$). To predict local pollution emergencies and to assess compliance to Mexican ambient air quality standards, we analyze hourly ozone and $\text{PM}_{10}$ measurements from 24 stations across Mexico City from 2017 using a bivariate spatiotemporal model. Using this model, we predict future pollutant levels using current weather conditions and recent pollutant concentrations. Using hourly pollutant projections, we predict regional maxima needed to estimate the probability of future pollution emergencies. We discuss how predicted compliance to legislated pollution limits varies across regions within Mexico City in 2017. We find that predicted probability of pollution emergencies is limited to a few time periods. In contrast, we show that predicted exceedance of Mexican ambient air quality standards is a common, nearly daily occurrence.

preprint2018arXiv

A Discrete View of the Indian Monsoon to Identify Spatial Patterns of Rainfall

We propose a representation of the Indian summer monsoon rainfall in terms of a probabilistic model based on a Markov Random Field, consisting of discrete state variables representing low and high rainfall at grid-scale and daily rainfall patterns across space and in time. These discrete states are conditioned on observed daily gridded rainfall data from the period 2000-2007. The model gives us a set of 10 spatial patterns of daily monsoon rainfall over India, which are robust over a range of user-chosen parameters as well as coherent in space and time. Each day in the monsoon season is assigned precisely one of the spatial patterns, that approximates the spatial distribution of rainfall on that day. Such approximations are quite accurate for nearly 95% of the days. Remarkably, these patterns are representative (with similar accuracy) of the monsoon seasons from 1901 to 2000 as well. Finally, we compare the proposed model with alternative approaches to extract spatial patterns of rainfall, using empirical orthogonal functions as well as clustering algorithms such as K-means and spectral clustering.

preprint2017arXiv

Comparing reverse complementary genomic words based on their distance distributions and frequencies

In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is explored also, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff&#39;s rules. This study uses both the complete human genome and its repeat-masked version.

preprint2017arXiv

Modeling Efficiency of Foreign Aid Allocation in Malawi

The Open Aid Malawi initiative has collected an unprecedented database that identifies as much location-specific information as possible for each of over 2500 individual foreign aid donations to Malawi since 2003. Ensuring efficient use and distribution of that aid is important to donors and to Malawi citizens. However, because of individual donor goals and difficulty in tracking donor coordination, determining presence or absence of efficient aid allocation is difficult. We compare several Bayesian spatial generalized linear mixed models to relate aid allocation to various economic indicators within seven donation sectors. We find that the spatial gamma regression model best predicts current aid allocation. Using this model, first we use inferences on coefficients to examine whether or not there is evidence of efficient aid allocation within each sector. Second, we use this model to determine a more efficient aid allocation scenario and compare this scenario to the current allocation to provide insight for future aid donations.

preprint2018arXiv

Clustering genomic words in human DNA using peaks and trends of distributions

In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend&#39;), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

preprint2016arXiv

Correlations and forecast of death tolls in the Syrian conflict

The Syrian civil war has been ongoing since 2011 and has already caused thousands of deaths. The analysis of death tolls helps to understand the dynamics of the conflict and to better allocate resources to the affected areas. In this article, we use information on the daily number of deaths to study temporal and spatial correlations in the data, and exploit this information to forecast events of deaths. We find that the number of deaths per day follows a log-normal distribution during the conflict. We have also identified strong correlations between cities and on consecutive days, implying that major deaths in one location are typically followed by major deaths in both the same location and in other areas. We find that war-related deaths are not random events and observing death tolls in some cities helps to better predict these numbers across the system.

preprint2017arXiv

Bias and high-dimensional adjustment in observational studies of peer effects

Peer effects, in which the behavior of an individual is affected by the behavior of their peers, are posited by multiple theories in the social sciences. Other processes can also produce behaviors that are correlated in networks and groups, thereby generating debate about the credibility of observational (i.e. nonexperimental) studies of peer effects. Randomized field experiments that identify peer effects, however, are often expensive or infeasible. Thus, many studies of peer effects use observational data, and prior evaluations of causal inference methods for adjusting observational data to estimate peer effects have lacked an experimental &#34;gold standard&#34; for comparison. Here we show, in the context of information and media diffusion on Facebook, that high-dimensional adjustment of a nonexperimental control group (677 million observations) using propensity score models produces estimates of peer effects statistically indistinguishable from those from using a large randomized experiment (220 million observations). Naive observational estimators overstate peer effects by 320% and commonly used variables (e.g., demographics) offer little bias reduction, but adjusting for a measure of prior behaviors closely related to the focal behavior reduces bias by 91%. High-dimensional models adjusting for over 3,700 past behaviors provide additional bias reduction, such that the full model reduces bias by over 97%. This experimental evaluation demonstrates that detailed records of individuals&#39; past behavior can improve studies of social influence, information diffusion, and imitation; these results are encouraging for the credibility of some studies but also cautionary for studies of rare or new behaviors. More generally, these results show how large, high-dimensional data sets and statistical learning techniques can be used to improve causal inference in the behavioral sciences.

preprint2019arXiv

Centered Partition Process: Informative Priors for Clustering

There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. For example, we are motivated by an epidemiological application, in which we wish to cluster birth defects into groups and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.

preprint2019arXiv

Hybrid Density- and Partition-based Clustering Algorithm for Data with Mixed-type Variables

Clustering is an essential technique for discovering patterns in data. The steady increase in amount and complexity of data over the years led to improvements and development of new clustering algorithms. However, algorithms that can cluster data with mixed variable types (continuous and categorical) remain limited, despite the abundance of data with mixed types particularly in the medical field. Among existing methods for mixed data, some posit unverifiable distributional assumptions or that the contributions of different variable types are not well balanced. We propose a two-step hybrid density- and partition-based algorithm (HyDaP) that can detect clusters after variables selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and recognize the important variables for clustering; the second step involves partition-based algorithm together with a novel dissimilarity measure we designed for mixed data to obtain clustering results. Simulations across various scenarios and data structures were conducted to examine the performance of the HyDaP algorithm compared to commonly used methods. We also applied the HyDaP algorithm on electronic health records to identify sepsis phenotypes.

preprint2018arXiv

Robust Estimation of Data-Dependent Causal Effects based on Observing a Single Time-Series

Consider the case that one observes a single time-series, where at each time t one observes a data record O(t) involving treatment nodes A(t), possible covariates L(t) and an outcome node Y(t). The data record at time t carries information for an (potentially causal) effect of the treatment A(t) on the outcome Y(t), in the context defined by a fixed dimensional summary measure Co(t). We are concerned with defining causal effects that can be consistently estimated, with valid inference, for sequentially randomized experiments without further assumptions. More generally, we consider the case when the (possibly causal) effects can be estimated in a double robust manner, analogue to double robust estimation of effects in the i.i.d. causal inference literature. We propose a general class of averages of conditional (context-specific) causal parameters that can be estimated in a double robust manner, therefore fully utilizing the sequential randomization. We propose a targeted maximum likelihood estimator (TMLE) of these causal parameters, and present a general theorem establishing the asymptotic consistency and normality of the TMLE. We extend our general framework to a number of typically studied causal target parameters, including a sequentially adaptive design within a single unit that learns the optimal treatment rule for the unit over time. Our work opens up robust statistical inference for causal questions based on observing a single time-series on a particular unit.

preprint2019arXiv

Double-Robust Estimation in Difference-in-Differences with an Application to Traffic Safety Evaluation

Difference-in-differences (DID) is a widely used approach for drawing causal inference from observational panel data. Two common estimation strategies for DID are outcome regression and propensity score weighting. In this paper, motivated by a real application in traffic safety research, we propose a new double-robust DID estimator that hybridizes regression and propensity score weighting. We particularly focus on the case of discrete outcomes. We show that the proposed double-robust estimator possesses the desirable large-sample robustness property. We conduct a simulation study to examine its finite-sample performance and compare with alternative methods. Our empirical results from a Pennsylvania Department of Transportation data suggest that rumble strips are marginally effective in reducing vehicle crashes.

preprint2019arXiv

Fair Regression for Health Care Spending

The distribution of health care payments to insurance plans has substantial consequences for social policy. Risk adjustment formulas predict spending in health insurance markets in order to provide fair benefits and health care coverage for all enrollees, regardless of their health status. Unfortunately, current risk adjustment formulas are known to underpredict spending for specific groups of enrollees leading to undercompensated payments to health insurers. This incentivizes insurers to design their plans such that individuals in undercompensated groups will be less likely to enroll, impacting access to health care for these groups. To improve risk adjustment formulas for undercompensated groups, we expand on concepts from the statistics, computer science, and health economics literature to develop new fair regression methods for continuous outcomes by building fairness considerations directly into the objective function. We additionally propose a novel measure of fairness while asserting that a suite of metrics is necessary in order to evaluate risk adjustment formulas more fully. Our data application using the IBM MarketScan Research Databases and simulation studies demonstrate that these new fair regression methods may lead to massive improvements in group fairness (e.g., 98%) with only small reductions in overall fit (e.g., 4%).

preprint2019arXiv

Exploiting new forms of data to study the private rented sector: strengths and limitations of a database of rental listings

Reviews of official statistics for UK housing have noted that developments have not kept pace with real-world change, particularly the rapid growth of private renting. This paper examines the potential value of big data in this context. We report on the construction of a dataset from the on-line adverts of one national lettings agency, describing the content of the dataset and efforts to validate it against external sources. Focussing on one urban area, we illustrate how the dataset can shed new light on local changes. Lastly, we discuss the issues involved in making more routine use of this kind of data.

preprint2019arXiv

A Bayesian hierarchical model for bridging across patient subgroups in phase I clinical trials with animal data

Incorporating preclinical animal data, which can be regarded as a special kind of historical data, into phase I clinical trials can improve decision making when very little about human toxicity is known. In this paper, we develop a robust hierarchical modelling approach to leverage animal data into new phase I clinical trials, where we bridge across non-overlapping, potentially heterogeneous patient subgroups. Translation parameters are used to bring both historical and contemporary data onto a common dosing scale. This leads to feasible exchangeability assumptions that the parameter vectors, which underpin the dose-toxicity relationship per study, are assumed to be drawn from a common distribution. Moreover, human dose-toxicity parameter vectors are assumed to be exchangeable either with the standardised, animal study-specific parameter vectors, or between themselves. Possibility of non-exchangeability for each parameter vector is considered to avoid inferences for extreme subgroups being overly influenced by the other. We illustrate the proposed approach with several trial data examples, and evaluate the operating characteristics of our model compared with several alternatives in a simulation study. Numerical results show that our approach yields robust inferences in circumstances, where data from multiple sources are inconsistent and/or the bridging assumptions are incorrect.

preprint2019arXiv

Nonparametric Bayesian Instrumental Variable Analysis: Evaluating Heterogeneous Effects of Coronary Arterial Access Site Strategies

Percutaneous coronary interventions (PCIs) are nonsurgical procedures to open blocked blood vessels to the heart, frequently using a catheter to place a stent. The catheter can be inserted into the blood vessels using an artery in the groin or an artery in the wrist. Because clinical trials have indicated that access via the wrist may result in fewer post procedure complications, shortening the length of stay, and ultimately cost less than groin access, adoption of access via the wrist has been encouraged. However, patients treated in usual care are likely to differ from those participating in clinical trials, and there is reason to believe that the effectiveness of wrist access may differ between males and females. Moreover, the choice of artery access strategy is likely to be influenced by patient or physician unmeasured factors. To study the effectiveness of the two artery access site strategies on hospitalization charges, we use data from a state-mandated clinical registry including 7,963 patients undergoing PCI. A hierarchical Bayesian likelihood-based instrumental variable analysis under a latent index modeling framework is introduced to jointly model outcomes and treatment status. Our approach accounts for unobserved heterogeneity via a latent factor structure, and permits nonparametric error distributions with Dirichlet process mixture models. Our results demonstrate that artery access in the wrist reduces hospitalization charges compared to access in the groin, with higher mean reduction for male patients.

People in this topic

12 visible researcher(s)