Researcher profile

Dimitrios S. Nikolopoulos

Dimitrios S. Nikolopoulos contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs

Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution by offloading KV cache management and parts of attention computation to the CPU. However, a key bottleneck remains: existing schedulers fail to effectively overlap CPU-offloaded tasks with GPU execution during the latency-critical, bandwidth-bound decode phase. This particularly penalizes real-time, decode-heavy applications (e.g., chat, Chain-of-Thought reasoning) which are currently underserved by existing systems, especially under memory pressure typical of edge or low-cost deployments. We present APEX, a novel, profiling-informed scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static rules or purely heuristic approaches, APEX dynamically dispatches compute across heterogeneous resources by predicting execution times of CPU and GPU subtasks to maximize overlap while avoiding scheduling overheads. We evaluate APEX on diverse workloads and GPU architectures (NVIDIA T4, A10), using LLaMa-2-7B and LLaMa-3.1-8B models. Compared to GPU-only schedulers like vLLM, APEX improves throughput by 84% - 96% on T4 and 11% - 89% on A10 GPUs, while preserving latency. Against the best existing hybrid schedulers, it delivers up to 72% (T4) and 37% (A10) higher throughput in long-output settings. APEX significantly advances hybrid LLM inference efficiency on such memory-constrained hardware and provides a blueprint for scheduling in heterogeneous AI systems, filling a critical gap for efficient real-time LLM applications.

preprint2026arXiv

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$\times$-1.81$\times$ on the consumer-grade RTX 4090 and 1.60$\times$-1.74$\times$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$\times$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware. The code is available at https://github.com/chosen-ox/dLLM-Serve.

preprint2020arXiv

Cross Architectural Power Modelling

Existing power modelling research focuses on the model rather than the process for developing models. An automated power modelling process that can be deployed on different processors for developing power models with high accuracy is developed. For this, (i) an automated hardware performance counter selection method that selects counters best correlated to power on both ARM and Intel processors, (ii) a noise filter based on clustering that can reduce the mean error in power models, and (iii) a two stage power model that surmounts challenges in using existing power models across multiple architectures are proposed and developed. The key results are: (i) the automated hardware performance counter selection method achieves comparable selection to the manual method reported in the literature, (ii) the noise filter reduces the mean error in power models by up to 55%, and (iii) the two stage power model can predict dynamic power with less than 8% error on both ARM and Intel processors, which is an improvement over classic models.

preprint2020arXiv

DYVERSE: DYnamic VERtical Scaling in Multi-tenant Edge Environments

Multi-tenancy in resource-constrained environments is a key challenge in Edge computing. In this paper, we develop 'DYVERSE: DYnamic VERtical Scaling in Edge' environments, which is the first light-weight and dynamic vertical scaling mechanism for managing resources allocated to applications for facilitating multi-tenancy in Edge environments. To enable dynamic vertical scaling, one static and three dynamic priority management approaches that are workload-aware, community-aware and system-aware, respectively are proposed. This research advocates that dynamic vertical scaling and priority management approaches reduce Service Level Objective (SLO) violation rates. An online-game and a face detection workload in a Cloud-Edge test-bed are used to validate the research. The merits of DYVERSE is that there is only a sub-second overhead per Edge server when 32 Edge servers are deployed on a single Edge node. When compared to executing applications on the Edge servers without dynamic vertical scaling, static priorities and dynamic priorities reduce SLO violation rates of requests by up to 4% and 12% for the online game, respectively, and in both cases 6% for the face detection workload. Moreover, for both workloads, the system-aware dynamic vertical scaling method effectively reduces the latency of non-violated requests, when compared to other methods.

preprint2020arXiv

Workload-Aware DRAM Error Prediction using Machine Learning

The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior for either predicting the errors in advance or for adjusting DRAM circuit parameters to achieve a better trade-off between energy efficiency and reliability. Existing modeling efforts may have studied the impact of few operating parameters and temperature on DRAM reliability using custom FPGAs setups, however they neglected the combined effect of workload-specific features that can be systematically investigated only on a real system. In this paper, we present the results of our study on workload-dependent DRAM error behavior within a real server considering various operating parameters, such as the refresh rate, voltage and temperature. We show that the rate of single- and multi-bit errors may vary across workloads by 8x, indicating that program inherent features can affect DRAM reliability significantly. Based on this observation, we extract 249 features, such as the memory access rate, the rate of cache misses, the memory reuse time and data entropy, from various compute-intensive, caching and analytics benchmarks. We apply several supervised learning methods to construct the DRAM error behavior model for 72 server-grade DRAM chips using the memory operating parameters and extracted program inherent features. Our results show that, with an appropriate choice of program features and supervised learning method, the rate of single- and multi-bit errors can be predicted for a specific DRAM module with an average error of less than 10.5 %, as opposed to the 2.9x estimation error obtained for a conventional workload-unaware error model.