Source author record

Mahmut Taylan Kandemir

Mahmut Taylan Kandemir appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Artificial Intelligence eess.SP Emerging Technologies Machine Learning Networking and Internet Architecture quant-ph

Catalog footprint

What is connected

5works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Pretraining large language models with MXFP4 on Native FP4 Hardware

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

preprint2022arXiv

Quantum Circuit Resizing

Existing quantum systems provide very limited physical qubit counts, trying to execute a quantum algorithm/circuit on them that have a higher number of logical qubits than physically available lead to a compile-time error. Given that it is unrealistic to expect existing quantum systems to provide, in near future, sufficient number of qubits that can accommodate large circuit, there is a pressing need to explore strategies that can somehow execute large circuits on small systems. In this paper, first, we perform an analysis to identify the qubits that are most suitable for circuit resizing. Our results reveal that, in most quantum programs, there exist qubits that can be reused mid-program to serially/sequentially execute the circuit employing fewer qubits. Motivated by this observation, we design, implement and evaluate a compiler-based approach that i) identifies the qubits that can be most beneficial for serial circuit execution; ii) selects those qubits to reuse at each step of execution for size minimization of the circuit; and iii) minimizes Middle Measurement (MM) delays due to impractical implementation of shots to improve the circuit reliability. Furthermore, since our approach intends to execute the circuits sequentially, the crosstalk errors can also be optimized as a result of the reduced number of concurrent gates. The experimental results indicate that our proposed approach can (i) execute large circuits that initially cannot fit into small circuits, on small quantum hardware, and (ii) can significantly improve the PST of the results by 2.1X when both original and our serialized programs can fit into the target quantum hardware.

preprint2022arXiv

Seeker: Synergizing Mobile and Energy Harvesting Wearable Sensors for Human Activity Recognition

There is an increasing demand for intelligent processing on emerging ultra-low-power internet of things (IoT) devices, and recent works have shown substantial efficiency boosts by executing inference tasks directly on the IoT device (node) rather than merely transmitting sensor data. However, the computation and power demands of Deep Neural Network (DNN)-based inference pose significant challenges for nodes in an energy-harvesting wireless sensor network (EH-WSN). Moreover, these tasks often require responses from multiple physically distributed EH sensor nodes, which imposes crucial system optimization challenges in addition to per-node constraints. To address these challenges, we propose \emph{Seeker}, a novel approach to efficiently execute DNN inferences for Human Activity Recognition (HAR) tasks, using both an EH-WSN and a host mobile device. Seeker minimizes communication overheads and maximizes computation at each sensor without violating the quality of service. \emph{Seeker} uses a \emph{store-and-execute} approach to complete a subset of inferences on the EH sensor node, reducing communication with the mobile host. Further, for those inferences unfinished because of harvested energy constraints, it leverages an \emph{activity aware coreset} (AAC) construction to efficiently communicate compact features to the host device where ensemble techniques are used to efficiently finish the inferences. \emph{Seeker} performs HAR with $86.8\%$ accuracy, surpassing the $81.2\%$ accuracy of a state of the art approach. Moreover, by using AAC, it lowers the communication data volume by $8.9\times$.

preprint2020arXiv

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Traditionally, HPC workloads have been deployed in bare-metal clusters; but the advances in virtualization have led the pathway for these workloads to be deployed in virtualized clusters. However, HPC cluster administrators/providers still face challenges in terms of resource elasticity and virtual machine (VM) provisioning at large-scale, due to the lack of coordination between a traditional HPC scheduler and the VM hypervisor (resource management layer). This lack of interaction leads to low cluster utilization and job completion throughput. Furthermore, the VM provisioning delays directly impact the overall performance of jobs in the cluster. Hence, there is a need for effectively provisioning virtualized HPC clusters, which can best-utilize the physical hardware with minimal provisioning overheads. Towards this, we propose Multiverse, a VM provisioning framework, which can dynamically spawn VMs for incoming jobs in a virtualized HPC cluster, by integrating the HPC scheduler along with VM resource manager. We have implemented this framework on the Slurm} scheduler along with the vSphere VM resource manager. In order to reduce the VM provisioning overheads, we use instant cloning which shares both the disk and memory with the parent VM, when compared to full VM cloning which has to boot-up a new VM from scratch. Measurements with real-world HPC workloads demonstrate that, instant cloning is 2.5x faster than full cloning in terms of VM provisioning time. Further, it improves resource utilization by up to 40%, and cluster throughput by up to 1.5x, when compared to full clone for bursty job arrival scenarios.

preprint2020arXiv

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc. These applications typically have diverserequirements in terms of accuracy and response latency, thathave a direct impact on the cost of deploying them in a publiccloud. Furthermore, the deployment cost also depends on thetype of resources being procured, which by themselves areheterogeneous in terms of provisioning latencies and billingcomplexity. Thus, it is strenuous for an inference servingsystem to choose from this confounding array of resourcetypes and model types to provide low-latency and cost-effectiveinferences. In this work we quantitatively characterize the cost,accuracy and latency implications of hosting ML inferenceson different public cloud resource offerings. In addition, wecomprehensively evaluate prior work which tries to achievecost-effective prediction-serving. Our evaluation shows that,prior work does not solve the problem from both dimensionsof model and resource heterogeneity. Hence, we argue that toaddress this problem, we need to holistically solve the issuesthat arise when trying to combine both model and resourceheterogeneity towards optimizing for application constraints.Towards this, we envision developing a self-managed inferenceserving system, which can optimize the application require-ments based on public cloud resource characteristics. In orderto solve this complex optimization problem, we explore the highlevel design of a reinforcement-learning based system that canefficiently adapt to the changing needs of the system at scale.

Mahmut Taylan Kandemir

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Pretraining large language models with MXFP4 on Native FP4 Hardware

Quantum Circuit Resizing

Seeker: Synergizing Mobile and Energy Harvesting Wearable Sensors for Human Activity Recognition

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud