Source author record

Ingo Scholtes

Ingo Scholtes appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks physics.soc-ph Machine Learning Software Engineering cond-mat.stat-mech nlin.AO Methodology Networking and Internet Architecture physics.data-an astro-ph.HE cond-mat.dis-nn cs.CY Data Structures and Algorithms Digital Libraries Graphics math.CO

Catalog footprint

What is connected

20works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Learning Neural Operator Surrogates for the Black Hole Accretion Code

General-relativistic magnetohydrodynamic (GR-MHD) simulations are essential for studying black hole accretion, relativistic jets, and magnetic reconnection, yet their computational cost severely limits systematic parameter exploration. We investigate neural operator surrogates for two astrophysically relevant simulation scenarios produced by the Black Hole Accretion Code (\texttt{BHAC}). First, a Physics Informed Fourier Neural Operator (PINO) is trained on the special-relativistic resistive MHD (SRRMHD) evolution of the Orszag-Tang vortex over a range of resistivities spanning the Sweet-Parker and fast reconnection regimes. By embedding the governing equations as an additional loss term evaluated at finer temporal resolution than the available data supervision, the model learns dynamics at time steps where no simulation data is provided, enabling recovery of plasmoid formation that a data-only baseline trained on the same sparse snapshots fails to reproduce. To our knowledge, the present work is the first application of a physics informed neural operator to special relativistic resistive MHD, and the first to investigate the capability of such models to resolve plasmoid formation in SRRMHD. In a second line of investigation, an OFormer-style Transformer Neural Operator is trained on the evolution of spine-sheath relativistic jets created with \texttt{BHAC}, in special-relativistic MHD (SRMHD). The model is directly applied on the adaptive mesh, highlighting the need for linear attention due to long sequences. The neural surrogate model is capable of capturing most of the major details, especially in early predictions. To our knowledge, this constitutes the first application of a neural operator directly on a high resolution adaptive mesh refinement grid in the context of MHD simulations.

preprint2026arXiv

The Role of Node Features in Graph Pooling

Graph pooling is commonly applied in graph classification, yet its empirical gains over standard WL-1 expressive GNNs are often marginal or inconsistent. We study this gap by analysing the interaction between node features and graph topology and their effect on pooling objectives. Our analysis reveals that pooling operators require node features that are well-aligned with the graph's topology -- a condition often overlooked and not guaranteed in empirical networks. We formalise fundamental requirements for node features to enable effective pooling, and introduce a quantitative measure of feature quality. Our empirical evaluation shows that, when these requirements are satisfied, pooling can be beneficial and improve performance on appropriate datasets.

preprint2022arXiv

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Massive data from software repositories and collaboration tools are widely used to study social aspects in software development. One question that several recent works have addressed is how a software project's size and structure influence team productivity, a question famously considered in Brooks' law. Recent studies using massive repository data suggest that developers in larger teams tend to be less productive than smaller teams. Despite using similar methods and data, other studies argue for a positive linear or even super-linear relationship between team size and productivity, thus contesting the view of software economics that software projects are diseconomies of scale. In our work, we study challenges that can explain the disagreement between recent studies of developer productivity in massive repository data. We further provide, to the best of our knowledge, the largest, curated corpus of GitHub projects tailored to investigate the influence of team size and collaboration patterns on individual and collective productivity. Our work contributes to the ongoing discussion on the choice of productivity metrics in the operationalisation of hypotheses about determinants of successful software projects. It further highlights general pitfalls in big data analysis and shows that the use of bigger data sets does not automatically lead to more reliable insights.

preprint2022arXiv

Sequential Motifs in Observed Walks

The structure of complex networks can be characterized by counting and analyzing network motifs. Motifs are small subgraphs that occur repeatedly in a network, such as triangles or chains. Recent work has generalized motifs to temporal and dynamic network data. However, existing techniques do not generalize to sequential or trajectory data, which represents entities moving through the nodes of a network, such as passengers moving through transportation networks. The unit of observation in these data is fundamentally different, since we analyze full observations of trajectories (e.g., a trip from airport A to airport C through airport B), rather than independent observations of edges or snapshots of graphs over time. In this work, we define sequential motifs in trajectory data, which are small, directed, and edge-weighted subgraphs corresponding to patterns in observed sequences. We draw a connection between counting and analysis of sequential motifs and Higher-Order Network (HON) models. We show that by mapping edges of a HON, specifically a $k$th-order DeBruijn graph, to sequential motifs, we can count and evaluate their importance in observed data. We test our methodology with two datasets: (1) passengers navigating an airport network and (2) people navigating the Wikipedia article network. We find that the most prevalent and important sequential motifs correspond to intuitive patterns of traversal in the real systems, and show empirically that the heterogeneity of edge weights in an observed higher-order DeBruijn graph has implications for the distributions of sequential motifs we expect to see across our null models.

preprint2020arXiv

HOTVis: Higher-Order Time-Aware Visualisation of Dynamic Graphs

Network visualisation techniques are important tools for the exploratory analysis of complex systems. While these methods are regularly applied to visualise data on complex networks, we increasingly have access to time series data that can be modelled as temporal networks or dynamic graphs. In dynamic graphs, the temporal ordering of time-stamped edges determines the causal topology of a system, i.e., which nodes can, directly and indirectly, influence each other via a so-called causal path. This causal topology is crucial to understand dynamical processes, assess the role of nodes, or detect clusters. However, we lack graph drawing techniques that incorporate this information into static visualisations. Addressing this gap, we present a novel dynamic graph visualisation algorithm that utilises higher-order graphical models of causal paths in time series data to compute time-aware static graph visualisations. These visualisations combine the simplicity and interpretability of static graphs with a time-aware layout algorithm that highlights patterns in the causal topology that result from the temporal dynamics of edges.

preprint2020arXiv

HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks

The unsupervised detection of anomalies in time series data has important applications in user behavioral modeling, fraud detection, and cybersecurity. Anomaly detection has, in fact, been extensively studied in categorical sequences. However, we often have access to time series data that represent paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies, we must account for the fact that such data contain a large number of independent observations of paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem, we introduce HYPA, a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph. HYPA provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.

preprint2020arXiv

Learning the Markov order of paths in a network

We study the problem of learning the Markov order in categorical sequences that represent paths in a network, i.e. sequences of variable lengths where transitions between states are constrained to a known graph. Such data pose challenges for standard Markov order detection methods and demand modelling techniques that explicitly account for the graph constraint. Adopting a multi-order modelling framework for paths, we develop a Bayesian learning technique that (i) more reliably detects the correct Markov order compared to a competing method based on the likelihood ratio test, (ii) requires considerably less data compared to methods using AIC or BIC, and (iii) is robust against partial knowledge of the underlying constraints. We further show that a recently published method that uses a likelihood ratio test has a tendency to overfit the true Markov order of paths, which is not the case for our Bayesian technique. Our method is important for data scientists analyzing patterns in categorical sequence data that are subject to (partially) known constraints, e.g. sequences with forbidden words, mobility trajectories and click stream data, or sequence data in bioinformatics. Addressing the key challenge of model selection, our work is further relevant for the growing body of research that emphasizes the need for higher-order models in network analysis.

preprint2019arXiv

Counting Causal Paths in Big Times Series Data on Networks

Graph or network representations are an important foundation for data mining and machine learning tasks in relational data. Many tools of network analysis, like centrality measures, information ranking, or cluster detection rest on the assumption that links capture direct influence, and that paths represent possible indirect influence. This assumption is invalidated in time-stamped network data capturing, e.g., dynamic social networks, biological sequences or financial transactions. In such data, for two time-stamped links (A,B) and (B,C) the chronological ordering and timing determines whether a causal path from node A via B to C exists. A number of works has shown that for that reason network analysis cannot be directly applied to time-stamped network data. Existing methods to address this issue require statistics on causal paths, which is computationally challenging for big data sets. Addressing this problem, we develop an efficient algorithm to count causal paths in time-stamped network data. Applying it to empirical data, we show that our method is more efficient than a baseline method implemented in an OpenSource data analytics package. Our method works efficiently for different values of the maximum time difference between consecutive links of a causal path and supports streaming scenarios. With it, we are closing a gap that hinders an efficient analysis of big time series data on complex networks.

preprint2017arXiv

From Relational Data to Graphs: Inferring Significant Links using Generalized Hypergeometric Ensembles

The inference of network topologies from relational data is an important problem in data analysis. Exemplary applications include the reconstruction of social ties from data on human interactions, the inference of gene co-expression networks from DNA microarray data, or the learning of semantic relationships based on co-occurrences of words in documents. Solving these problems requires techniques to infer significant links in noisy relational data. In this short paper, we propose a new statistical modeling framework to address this challenge. It builds on generalized hypergeometric ensembles, a class of generative stochastic models that give rise to analytically tractable probability spaces of directed, multi-edge graphs. We show how this framework can be used to assess the significance of links in noisy relational data. We illustrate our method in two data sets capturing spatio-temporal proximity relations between actors in a social system. The results show that our analytical framework provides a new approach to infer significant links from relational data, with interesting perspectives for the mining of data on social systems.

preprint2016arXiv

Generalized Hypergeometric Ensembles: Statistical Hypothesis Testing in Complex Networks

Statistical ensembles of networks, i.e., probability spaces of all networks that are consistent with given aggregate statistics, have become instrumental in the analysis of complex networks. Their numerical and analytical study provides the foundation for the inference of topological patterns, the definition of network-analytic measures, as well as for model selection and statistical hypothesis testing. Contributing to the foundation of these data analysis techniques, in this Letter we introduce generalized hypergeometric ensembles, a broad class of analytically tractable statistical ensembles of finite, directed and weighted networks. This framework can be interpreted as a generalization of the classical configuration model, which is commonly used to randomly generate networks with a given degree sequence or distribution. Our generalization rests on the introduction of dyadic link propensities, which capture the degree-corrected tendencies of pairs of nodes to form edges between each other. Studying empirical and synthetic data, we show that our approach provides broad perspectives for model selection and statistical hypothesis testing in data on complex networks.

preprint2015arXiv

Causality-Driven Slow-Down and Speed-Up of Diffusion in Non-Markovian Temporal Networks

Recent research has highlighted limitations of studying complex systems with time-varying topologies from the perspective of static, time-aggregated networks. Non-Markovian characteristics resulting from the ordering of interactions in temporal networks were identified as one important mechanism that alters causality, and affects dynamical processes. So far, an analytical explanation for this phenomenon and for the significant variations observed across different systems is missing. Here we introduce a methodology that allows to analytically predict causality-driven changes of diffusion speed in non-Markovian temporal networks. Validating our predictions in six data sets, we show that - compared to the time-aggregated network - non-Markovian characteristics can lead to both a slow-down, or speed-up of diffusion which can even outweigh the decelerating effect of community structures in the static topology. Thus, non-Markovian properties of temporal networks constitute an important additional dimension of complexity in time-varying complex systems.

preprint2014arXiv

Predicting Scientific Success Based on Coauthorship Networks

We address the question to what extent the success of scientific articles is due to social influence. Analyzing a data set of over 100000 publications from the field of Computer Science, we study how centrality in the coauthorship network differs between authors who have highly cited papers and those who do not. We further show that a machine learning classifier, based only on coauthorship network centrality measures at time of publication, is able to predict with high precision whether an article will be highly cited five years after publication. By this we provide quantitative insight into the social dimension of scientific publishing - challenging the perception of citations as an objective, socially unbiased measure of scientific success.

preprint2013arXiv

A Quantitative Study of Social Organisation in Open Source Software Communities

The success of open source projects crucially depends on the voluntary contributions of a sufficiently large community of users. Apart from the mere size of the community, interesting questions arise when looking at the evolution of structural features of collaborations between community members. In this article, we discuss several network analytic proxies that can be used to quantify different aspects of the social organisation in social collaboration networks. We particularly focus on measures that can be related to the cohesiveness of the communities, the distribution of responsibilities and the resilience against turnover of community members. We present a comparative analysis on a large-scale dataset that covers the full history of collaborations between users of 14 major open source software communities. Our analysis covers both aggregate and time-evolving measures and highlights differences in the social organisation across communities. We argue that our results are a promising step towards the definition of suitable, potentially multi-dimensional, resilience and risk indicators for open source software communities.

preprint2013arXiv

Betweenness Preference: Quantifying Correlations in the Topological Dynamics of Temporal Networks

We study correlations in temporal networks and introduce the notion of betweenness preference. It allows to quantify to what extent paths, existing in time-aggregated representations of temporal networks, are actually realizable based on the sequence of interactions. We show that betweenness preference is present in empirical temporal network data and that it influences the length of shortest time-respecting paths. Using four different data sets, we further argue that neglecting betweenness preference leads to wrong conclusions about dynamical processes on temporal networks.

preprint2013arXiv

Categorizing Bugs with Social Networks: A Case Study on Four Open Source Software Communities

Efficient bug triaging procedures are an important precondition for successful collaborative software engineering projects. Triaging bugs can become a laborious task particularly in open source software (OSS) projects with a large base of comparably inexperienced part-time contributors. In this paper, we propose an efficient and practical method to identify valid bug reports which a) refer to an actual software bug, b) are not duplicates and c) contain enough information to be processed right away. Our classification is based on nine measures to quantify the social embeddedness of bug reporters in the collaboration network. We demonstrate its applicability in a case study, using a comprehensive data set of more than 700,000 bug reports obtained from the Bugzilla installation of four major OSS communities, for a period of more than ten years. For those projects that exhibit the lowest fraction of valid bug reports, we find that the bug reporters' position in the collaboration network is a strong indicator for the quality of bug reports. Based on this finding, we develop an automated classification scheme that can easily be integrated into bug tracking platforms and analyze its performance in the considered OSS communities. A support vector machine (SVM) to identify valid bug reports based on the nine measures yields a precision of up to 90.3% with an associated recall of 38.9%. With this, we significantly improve the results obtained in previous case studies for an automated early identification of bugs that are eventually fixed. Furthermore, our study highlights the potential of using quantitative measures of social organization in collaborative software engineering. It also opens a broad perspective for the integration of social awareness in the design of support infrastructures.

preprint2013arXiv

The Rise and Fall of a Central Contributor: Dynamics of Social Organization and Performance in the Gentoo Community

Social organization and division of labor crucially influence the performance of collaborative software engineering efforts. In this paper, we provide a quantitative analysis of the relation between social organization and performance in Gentoo, an Open Source community developing a Linux distribution. We study the structure and dynamics of collaborations as recorded in the project's bug tracking system over a period of ten years. We identify a period of increasing centralization after which most interactions in the community were mediated by a single central contributor. In this period of maximum centralization, the central contributor unexpectedly left the project, thus posing a significant challenge for the community. We quantify how the rise, the activity as well as the subsequent sudden dropout of this central contributor affected both the social organization and the bug handling performance of the Gentoo community. We analyze social organization from the perspective of network theory and augment our quantitative findings by interviews with prominent members of the Gentoo community which shared their personal insights.

preprint2012arXiv

A Tunable Mechanism for Identifying Trusted Nodes in Large Scale Distributed Networks

In this paper, we propose a simple randomized protocol for identifying trusted nodes based on personalized trust in large scale distributed networks. The problem of identifying trusted nodes, based on personalized trust, in a large network setting stems from the huge computation and message overhead involved in exhaustively calculating and propagating the trust estimates by the remote nodes. However, in any practical scenario, nodes generally communicate with a small subset of nodes and thus exhaustively estimating the trust of all the nodes can lead to huge resource consumption. In contrast, our mechanism can be tuned to locate a desired subset of trusted nodes, based on the allowable overhead, with respect to a particular user. The mechanism is based on a simple exchange of random walk messages and nodes counting the number of times they are being hit by random walkers of nodes in their neighborhood. Simulation results to analyze the effectiveness of the algorithm show that using the proposed algorithm, nodes identify the top trusted nodes in the network with a very high probability by exploring only around 45% of the total nodes, and in turn generates nearly 90% less overhead as compared to an exhaustive trust estimation mechanism, named TrustWebRank. Finally, we provide a measure of the global trustworthiness of a node; simulation results indicate that the measures generated using our mechanism differ by only around 0.6% as compared to TrustWebRank.

preprint2012arXiv

Hierarchical Consensus Formation Reduces the Influence of Opinion Bias

We study the role of hierarchical structures in a simple model of collective consensus formation based on the bounded confidence model with continuous individual opinions. For the particular variation of this model considered in this paper, we assume that a bias towards an extreme opinion is introduced whenever two individuals interact and form a common decision. As a simple proxy for hierarchical social structures, we introduce a two-step decision making process in which in the second step groups of like-minded individuals are replaced by representatives once they have reached local consensus, and the representatives in turn form a collective decision in a downstream process. We find that the introduction of such a hierarchical decision making structure can improve consensus formation, in the sense that the eventual collective opinion is closer to the true average of individual opinions than without it. In particular, we numerically study how the size of groups of like-minded individuals being represented by delegate individuals affects the impact of the bias on the final population-wide consensus. These results are of interest for the design of organisational policies and the optimisation of hierarchical structures in the context of group decision making.

preprint2012arXiv

Organic Design of Massively Distributed Systems: A Complex Networks Perspective

The vision of Organic Computing addresses challenges that arise in the design of future information systems that are comprised of numerous, heterogeneous, resource-constrained and error-prone components or devices. Here, the notion organic particularly highlights the idea that, in order to be manageable, such systems should exhibit self-organization, self-adaptation and self-healing characteristics similar to those of biological systems. In recent years, the principles underlying many of the interesting characteristics of natural systems have been investigated from the perspective of complex systems science, particularly using the conceptual framework of statistical physics and statistical mechanics. In this article, we review some of the interesting relations between statistical physics and networked systems and discuss applications in the engineering of organic networked computing systems with predictable, quantifiable and controllable self-* properties.

preprint2010arXiv

Distributed Creation and Adaptation of Random Scale-Free Overlay Networks

Random scale-free overlay topologies provide a number of properties like for example high resilience against failures of random nodes, small (average) diameter as well as good expansion and congestion characteristics that make them interesting for the use in large-scale distributed systems. A number of these properties have been shown to be influenced by the exponent γof their degree distribution P(k) ~ k^{-γ}. In this article, we present a distributed rewiring scheme that is suitable to effectuate scale-free overlay topologies with an adjustable exponent. The scheme uses a biased random walk strategy to sample new endpoints of edges being rewired and relies on a simple equilibrium model for scale-free networks. The bias of the random walk strategy can be tuned to produce random scale-free networks with arbitrary degree distribution exponents greater than two. We argue that the rewiring strategy can be implemented in a distributed fashion based on a node's information about its immediate neighbors. We present both analytical arguments as well as results that have been obtained using an implementation of the proposed protocol.

Ingo Scholtes

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Learning Neural Operator Surrogates for the Black Hole Accretion Code

The Role of Node Features in Graph Pooling

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Sequential Motifs in Observed Walks

HOTVis: Higher-Order Time-Aware Visualisation of Dynamic Graphs

HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks

Learning the Markov order of paths in a network

Counting Causal Paths in Big Times Series Data on Networks

From Relational Data to Graphs: Inferring Significant Links using Generalized Hypergeometric Ensembles

Generalized Hypergeometric Ensembles: Statistical Hypothesis Testing in Complex Networks

Causality-Driven Slow-Down and Speed-Up of Diffusion in Non-Markovian Temporal Networks

Predicting Scientific Success Based on Coauthorship Networks

A Quantitative Study of Social Organisation in Open Source Software Communities

Betweenness Preference: Quantifying Correlations in the Topological Dynamics of Temporal Networks

Categorizing Bugs with Social Networks: A Case Study on Four Open Source Software Communities

The Rise and Fall of a Central Contributor: Dynamics of Social Organization and Performance in the Gentoo Community

A Tunable Mechanism for Identifying Trusted Nodes in Large Scale Distributed Networks

Hierarchical Consensus Formation Reduces the Influence of Opinion Bias

Organic Design of Massively Distributed Systems: A Complex Networks Perspective

Distributed Creation and Adaptation of Random Scale-Free Overlay Networks