Source author record

Carsten Jentsch

Carsten Jentsch appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Machine Learning Artificial Intelligence Computation Computation and Language Methodology

Catalog footprint

What is connected

5works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Testing exogeneity in the functional linear regression model

We propose a novel test statistic for testing exogeneity in the functional linear regression model. In contrast to Hausman-type tests in finite dimensional linear regression setups, a direct extension to the functional linear regression model is not possible. Instead, we propose a test statistic based on the sum of the squared difference of projections of the two estimators for testing the null hypothesis of exogeneity in the functional linear regression model. We derive asymptotic normality under the null and consistency under general alternatives. Moreover, we prove bootstrap consistency results for residual-based bootstraps. In simulations, we investigate the finite sample performance of the proposed testing approach and illustrate the superiority of bootstrap-based approaches. In particular, the bootstrap approaches turn out to be much more robust with respect to the choice of the regularization parameter.

preprint2020arXiv

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment via Gibbs Sampling is based on conditional distributions, leading to different results in replicated runs on the same text data. This fact is often neglected in everyday practice. We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. Using such similarities, topics can be clustered. A new pruning algorithm for hierarchical clustering results based on the idea that two LDA runs create pairs of similar topics is proposed. This approach leads to the new measure S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal {\bf P}runing) for quantifying the stability of LDA models. We discuss some characteristics of this measure and illustrate it with an application to real data consisting of newspaper articles from \textit{USA Today}. Our results show that the measure S-CLOP is useful for assessing the stability of LDA models or any other topic modeling procedure that characterize its topics by word distributions. Based on the newly proposed measure for LDA stability, we propose a method to increase the reliability and hence to improve the reproducibility of empirical findings based on topic modeling. This increase in reliability is obtained by running the LDA several times and taking as prototype the most representative run, that is the LDA run with highest average similarity to all other runs.

preprint2020arXiv

Random boosting and random^2 forests -- A random tree depth injection approach

The induction of additional randomness in parallel and sequential ensemble methods has proven to be worthwhile in many aspects. In this manuscript, we propose and examine a novel random tree depth injection approach suitable for sequential and parallel tree-based approaches including Boosting and Random Forests. The resulting methods are called \emph{Random Boost} and \emph{Random$^2$ Forest}. Both approaches serve as valuable extensions to the existing literature on the gradient boosting framework and random forests. A Monte Carlo simulation, in which tree-shaped data sets with different numbers of final partitions are built, suggests that there are several scenarios where \emph{Random Boost} and \emph{Random$^2$ Forest} can improve the prediction performance of conventional hierarchical boosting and random forest approaches. The new algorithms appear to be especially successful in cases where there are merely a few high-order interactions in the generated data. In addition, our simulations suggest that our random tree depth injection approach can improve computation time by up to 40%, while at the same time the performance losses in terms of prediction accuracy turn out to be minor or even negligible in most cases.

preprint2015arXiv

Covariance matrix estimation and linear process bootstrap for multivariate time series of possibly increasing dimension

Multivariate time series present many challenges, especially when they are high dimensional. The paper's focus is twofold. First, we address the subject of consistently estimating the autocovariance sequence; this is a sequence of matrices that we conveniently stack into one huge matrix. We are then able to show consistency of an estimator based on the so-called flat-top tapers; most importantly, the consistency holds true even when the time series dimension is allowed to increase with the sample size. Second, we revisit the linear process bootstrap (LPB) procedure proposed by McMurry and Politis [J. Time Series Anal. 31 (2010) 471-482] for univariate time series. Based on the aforementioned stacked autocovariance matrix estimator, we are able to define a version of the LPB that is valid for multivariate time series. Under rather general assumptions, we show that our multivariate linear process bootstrap (MLPB) has asymptotic validity for the sample mean in two important cases: (a) when the time series dimension is fixed and (b) when it is allowed to increase with sample size. As an aside, in case (a) we show that the MLPB works also for spectral density estimators which is a novel result even in the univariate case. We conclude with a simulation study that demonstrates the superiority of the MLPB in some important cases.

preprint2015arXiv

Testing equality of spectral densities using randomization techniques

In this paper, we investigate the testing problem that the spectral density matrices of several, not necessarily independent, stationary processes are equal. Based on an $L_2$-type test statistic, we propose a new nonparametric approach, where the critical values of the tests are calculated with the help of randomization methods. We analyze asymptotic exactness and consistency of these randomization tests and show in simulation studies that the new procedures posses very good size and power characteristics.

Carsten Jentsch

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Testing exogeneity in the functional linear regression model

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

Random boosting and random^2 forests -- A random tree depth injection approach

Covariance matrix estimation and linear process bootstrap for multivariate time series of possibly increasing dimension

Testing equality of spectral densities using randomization techniques