Researcher profile

Enkelejda Kasneci

Enkelejda Kasneci contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
23works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

23 published item(s)

preprint2026arXiv

Exploring Context-aware and LLM-driven Locomotion for Immersive Virtual Reality

Locomotion plays a crucial role in shaping the user experience within virtual reality environments. In particular, hands-free locomotion offers a valuable alternative by supporting accessibility and freeing users from reliance on handheld controllers. To this end, traditional speech-based methods often depend on rigid command sets, limiting the naturalness and flexibility of interaction. In this study, we propose a novel locomotion technique powered by large language models (LLMs), which allows users to navigate virtual environments using natural language with contextual awareness. We evaluate three locomotion methods: controller-based teleportation, voice-based steering, and our language model-driven approach. Our evaluation combines eye-tracking data analysis, including exploratory explainable machine learning analysis with SHAP, and standardized questionnaires (SUS, IPQ, CSQ-VR, NASA-TLX) to examine user experience through both objective gaze-based measures and subjective self-reports of usability, presence, cybersickness, and cognitive load. Our findings show no statistically significant differences in usability, presence, or cybersickness between LLM-driven locomotion and established methods such as teleportation, suggesting its potential as a viable, natural language-based, hands-free alternative. In addition, eye-tracking analysis revealed patterns suggesting tendency toward increased user attention and engagement in the LLM-driven condition. Complementary to these findings, exploratory SHAP analysis revealed that fixation, saccade, and pupil-related features vary across techniques, indicating distinct patterns of visual attention and cognitive processing. Overall, we state that our method can facilitate hands-free locomotion in virtual spaces, especially in supporting accessibility.

preprint2026arXiv

Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.

preprint2026arXiv

TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types

We present TEyeD, the world's largest unified public data set of eye images taken with head-mounted devices. TEyeD was acquired with seven different head-mounted eye trackers. Among them, two eye trackers were integrated into virtual reality (VR) or augmented reality (AR) devices. The images in TEyeD were obtained from various tasks, including car rides, simulator rides, outdoor sports activities, and daily indoor activities. The data set includes 2D and 3D landmarks, semantic segmentation, 3D eyeball annotation and the gaze vector and eye movement types for all images. Landmarks and semantic segmentation are provided for the pupil, iris and eyelids. Video lengths vary from a few minutes to several hours. With more than 20 million carefully annotated images, TEyeD provides a unique, coherent resource and a valuable foundation for advancing research in the field of computer vision, eye tracking and gaze estimation in modern VR and AR applications. Download: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTEyeDS&mode=list Alternative Download: https://hctlsrva.edu.sot.tum.de/TEyeDS/

preprint2023arXiv

Precise localization of corneal reflections in eye images using deep learning trained on synthetic data

We present a deep learning method for accurately localizing the center of a single corneal reflection (CR) in an eye image. Unlike previous approaches, we use a convolutional neural network (CNN) that was trained solely using simulated data. Using only simulated data has the benefit of completely sidestepping the time-consuming process of manual annotation that is required for supervised training on real eye images. To systematically evaluate the accuracy of our method, we first tested it on images with simulated CRs placed on different backgrounds and embedded in varying levels of noise. Second, we tested the method on high-quality videos captured from real eyes. Our method outperformed state-of-the-art algorithmic methods on real eye images with a 35% reduction in terms of spatial precision, and performed on par with state-of-the-art on simulated images in terms of spatial accuracy.We conclude that our method provides a precise method for CR center localization and provides a solution to the data availability problem which is one of the important common roadblocks in the development of deep learning models for gaze estimation. Due to the superior CR center localization and ease of application, our method has the potential to improve the accuracy and precision of CR-based eye trackers

preprint2022arXiv

A Consistent and Efficient Evaluation Strategy for Attribution Methods

With a variety of local feature attribution methods being proposed in recent years, follow-up work suggested several evaluation strategies. To assess the attribution quality across different attribution techniques, the most popular among these evaluation strategies in the image domain use pixel perturbations. However, recent advances discovered that different evaluation strategies produce conflicting rankings of attribution methods and can be prohibitively expensive to compute. In this work, we present an information-theoretic analysis of evaluation strategies based on pixel perturbations. Our findings reveal that the results are strongly affected by information leakage through the shape of the removed pixels as opposed to their actual values. Using our theoretical insights, we propose a novel evaluation framework termed Remove and Debias (ROAD) which offers two contributions: First, it mitigates the impact of the confounders, which entails higher consistency among evaluation strategies. Second, ROAD does not require the computationally expensive retraining step and saves up to 99% in computational costs compared to the state-of-the-art. We release our source code at https://github.com/tleemann/road_evaluation.

preprint2022arXiv

A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning

Proxy-based Deep Metric Learning (DML) learns deep representations by embedding images close to their class representatives (proxies), commonly with respect to the angle between them. However, this disregards the embedding norm, which can carry additional beneficial context such as class- or image-intrinsic uncertainty. In addition, proxy-based DML struggles to learn class-internal structures. To address both issues at once, we introduce non-isotropic probabilistic proxy-based DML. We model images as directional von Mises-Fisher (vMF) distributions on the hypersphere that can reflect image-intrinsic uncertainties. Further, we derive non-isotropic von Mises-Fisher (nivMF) distributions for class proxies to better represent complex class-specific variances. To measure the proxy-to-image distance between these models, we develop and investigate multiple distribution-to-point and distribution-to-distribution metrics. Each framework choice is motivated by a set of ablational studies, which showcase beneficial properties of our probabilistic approach to proxy-based DML, such as uncertainty-awareness, better-behaved gradients during training, and overall improved generalization performance. The latter is especially reflected in the competitive performance on the standard DML benchmarks, where our approach compares favorably, suggesting that existing proxy-based DML can significantly benefit from a more probabilistic treatment. Code is available at github.com/ExplainableML/Probabilistic_Deep_Metric_Learning.

preprint2022arXiv

User Trust on an Explainable AI-based Medical Diagnosis Support System

Recent research has supported that system explainability improves user trust and willingness to use medical AI for diagnostic support. In this paper, we use chest disease diagnosis based on X-Ray images as a case study to investigate user trust and reliance. Building off explainability, we propose a support system where users (radiologists) can view causal explanations for final decisions. After observing these causal explanations, users provided their opinions of the model predictions and could correct explanations if they did not agree. We measured user trust as the agreement between the model's and the radiologist's diagnosis as well as the radiologists' feedback on the model explanations. Additionally, they reported their trust in the system. We tested our model on the CXR-Eye dataset and it achieved an overall accuracy of 74.1%. However, the experts in our user study agreed with the model for only 46.4% of the cases, indicating the necessity of improving the trust. The self-reported trust score was 3.2 on a scale of 1.0 to 5.0, showing that the users tended to trust the model but the trust still needs to be enhanced.

preprint2022arXiv

Where and What: Driver Attention-based Object Detection

Human drivers use their attentional mechanisms to focus on critical objects and make decisions while driving. As human attention can be revealed from gaze data, capturing and analyzing gaze information has emerged in recent years to benefit autonomous driving technology. Previous works in this context have primarily aimed at predicting "where" human drivers look at and lack knowledge of "what" objects drivers focus on. Our work bridges the gap between pixel-level and object-level attention prediction. Specifically, we propose to integrate an attention prediction module into a pretrained object detection framework and predict the attention in a grid-based style. Furthermore, critical objects are recognized based on predicted attended-to areas. We evaluate our proposed method on two driver attention datasets, BDD-A and DR(eye)VE. Our framework achieves competitive state-of-the-art performance in the attention prediction on both pixel-level and object-level but is far more efficient (75.3 GFLOPs less) in computation.

preprint2021arXiv

A Multimodal Eye Movement Dataset and a Multimodal Eye Movement Segmentation Analysis

We present a new dataset with annotated eye movements. The dataset consists of over 800,000 gaze points recorded during a car ride in the real world and in the simulator. In total, the eye movements of 19 subjects were annotated. In this dataset there are several data sources such as the eyelid closure, the pupil center, the optical vector, and a vector into the pupil center starting from the center of the eye corners. These different data sources are analyzed and evaluated individually as well as in combination with respect to their goodness of fit for eye movement classification. These results will help developers of real-time systems and algorithms to find the best data sources for their application. Also, new algorithms can be trained and evaluated on this data set. The data and the Matlab code can be downloaded here https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/?p=%2FA%20Multimodal%20Eye%20Movement%20Dataset%20and%20...&mode=list

preprint2021arXiv

Artificial Intelligence Methods in In-Cabin Use Cases: A Survey

As interest in autonomous driving increases, efforts are being made to meet requirements for the high-level automation of vehicles. In this context, the functionality inside the vehicle cabin plays a key role in ensuring a safe and pleasant journey for driver and passenger alike. At the same time, recent advances in the field of artificial intelligence (AI) have enabled a whole range of new applications and assistance systems to solve automated problems in the vehicle cabin. This paper presents a thorough survey on existing work that utilizes AI methods for use-cases inside the driving cabin, focusing, in particular, on application scenarios related to (1) driving safety and (2) driving comfort. Results from the surveyed works show that AI technology has a promising future in tackling in-cabin tasks within the autonomous driving aspect.

preprint2021arXiv

Differentiating Surgeon Expertise Solely by Eye Movement Features

Developments in computer science in recent years are moving into hospitals. Surgeons are faced with ever new technical challenges. Visual perception plays a key role in most of these. Diagnostic and training models are needed to optimize the training of young surgeons. In this study, we present a model for classifying experts, 4th-year residents and 3rd-year residents, using only eye movements. We show a model that uses a minimal set of features and still achieve a robust accuracy of 76.46 % to classify eye movements into the correct class. Likewise, in this study, we address the evolutionary steps of visual perception between three expertise classes, forming a first step towards a diagnostic model for expertise.

preprint2021arXiv

Explainable Online Validation of Machine Learning Models for Practical Applications

We present a reformulation of the regression and classification, which aims to validate the result of a machine learning algorithm. Our reformulation simplifies the original problem and validates the result of the machine learning algorithm using the training data. Since the validation of machine learning algorithms must always be explainable, we perform our experiments with the kNN algorithm as well as with an algorithm based on conditional probabilities, which is proposed in this work. For the evaluation of our approach, three publicly available data sets were used and three classification and two regression problems were evaluated. The presented algorithm based on conditional probabilities is also online capable and requires only a fraction of memory compared to the kNN algorithm.

preprint2021arXiv

Eye Movement Feature Classification for Soccer Goalkeeper Expertise Identification in Virtual Reality

The latest research in expertise assessment of soccer players has affirmed the importance of perceptual skills (especially for decision making) by focusing either on high experimental control or on a realistic presentation. To assess the perceptual skills of athletes in an optimized manner, we captured omnidirectional in-field scenes and showed these to 12 expert, 10 intermediate and 13 novice soccer goalkeepers on virtual reality glasses. All scenes were shown from the same natural goalkeeper perspective and ended after the return pass to the goalkeeper. Based on their gaze behavior we classified their expertise with common machine learning techniques. This pilot study shows promising results for objective classification of goalkeepers expertise based on their gaze behaviour and provided valuable insight to inform the design of training systems to enhance perceptual skills of athletes.

preprint2021arXiv

Fully Convolutional Neural Networks for Raw Eye Tracking Data Segmentation, Generation, and Reconstruction

In this paper, we use fully convolutional neural networks for the semantic segmentation of eye tracking data. We also use these networks for reconstruction, and in conjunction with a variational auto-encoder to generate eye movement data. The first improvement of our approach is that no input window is necessary, due to the use of fully convolutional networks and therefore any input size can be processed directly. The second improvement is that the used and generated data is raw eye tracking data (position X, Y and time) without preprocessing. This is achieved by pre-initializing the filters in the first layer and by building the input tensor along the z axis. We evaluated our approach on three publicly available datasets and compare the results to the state of the art.

preprint2021arXiv

Multi Layer Neural Networks as Replacement for Pooling Operations

Pooling operations, which can be calculated at low cost and serve as a linear or nonlinear transfer function for data reduction, are found in almost every modern neural network. Countless modern approaches have already tackled replacing the common maximum value selection and mean value operations, not to mention providing a function that allows different functions to be selected through changing parameters. Additional neural networks are used to estimate the parameters of these pooling functions.Consequently, pooling layers may require supplementary parameters to increase the complexity of the whole model. In this work, we show that one perceptron can already be used effectively as a pooling operation without increasing the complexity of the model. This kind of pooling allows for the integration of multi-layer neural networks directly into a model as a pooling operation by restructuring the data and, as a result, learnin complex pooling operations. We compare our approach to tensor convolution with strides as a pooling operation and show that our approach is both effective and reduces complexity. The restructuring of the data in combination with multiple perceptrons allows for our approach to be used for upscaling, which can then be utilized for transposed convolutions in semantic segmentation.

preprint2021arXiv

Rotated Ring, Radial and Depth Wise Separable Radial Convolutions

Simple image rotations significantly reduce the accuracy of deep neural networks. Moreover, training with all possible rotations increases the data set, which also increases the training duration. In this work, we address trainable rotation invariant convolutions as well as the construction of nets, since fully connected layers can only be rotation invariant with a one-dimensional input. On the one hand, we show that our approach is rotationally invariant for different models and on different public data sets. We also discuss the influence of purely rotational invariant features on accuracy. The rotationally adaptive convolution models presented in this work are more computationally intensive than normal convolution models. Therefore, we also present a depth wise separable approach with radial convolution. Link to CUDA code https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/

preprint2021arXiv

Weight and Gradient Centralization in Deep Neural Networks

Batch normalization is currently the most widely used variant of internal normalization for deep neural networks. Additional work has shown that the normalization of weights and additional conditioning as well as the normalization of gradients further improve the generalization. In this work, we combine several of these methods and thereby increase the generalization of the networks. The advantage of the newer methods compared to the batch normalization is not only increased generalization, but also that these methods only have to be applied during training and, therefore, do not influence the running time during use. Link to CUDA code https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/

preprint2020arXiv

Attention Flow: End-to-End Joint Attention Estimation

This paper addresses the problem of understanding joint attention in third-person social scene videos. Joint attention is the shared gaze behaviour of two or more individuals on an object or an area of interest and has a wide range of applications such as human-computer interaction, educational assessment, treatment of patients with attention disorders, and many more. Our method, Attention Flow, learns joint attention in an end-to-end fashion by using saliency-augmented attention maps and two novel convolutional attention mechanisms that determine to select relevant features and improve joint attention localization. We compare the effect of saliency maps and attention mechanisms and report quantitative and qualitative results on the detection and localization of joint attention in the VideoCoAtt dataset, which contains complex social scenes.

preprint2020arXiv

Automated Anonymisation of Visual and Audio Data in Classroom Studies

Understanding students' and teachers' verbal and non-verbal behaviours during instruction may help infer valuable information regarding the quality of teaching. In education research, there have been many studies that aim to measure students' attentional focus on learning-related tasks: Based on audio-visual recordings and manual or automated ratings of behaviours of teachers and students. Student data is, however, highly sensitive. Therefore, ensuring high standards of data protection and privacy has the utmost importance in current practices. For example, in the context of teaching management studies, data collection is carried out with the consent of pupils, parents, teachers and school administrations. Nevertheless, there may often be students whose data cannot be used for research purposes. Excluding these students from the classroom is an unnatural intrusion into the organisation of the classroom. A possible solution would be to request permission to record the audio-visual recordings of all students (including those who do not voluntarily participate in the study) and to anonymise their data. Yet, the manual anonymisation of audio-visual data is very demanding. In this study, we examine the use of artificial intelligence methods to automatically anonymise the visual and audio data of a particular person.

preprint2020arXiv

Deep semantic gaze embedding and scanpath comparison for expertise classification during OPT viewing

Modeling eye movement indicative of expertise behavior is decisive in user evaluation. However, it is indisputable that task semantics affect gaze behavior. We present a novel approach to gaze scanpath comparison that incorporates convolutional neural networks (CNN) to process scene information at the fixation level. Image patches linked to respective fixations are used as input for a CNN and the resulting feature vectors provide the temporal and spatial gaze information necessary for scanpath similarity comparison.We evaluated our proposed approach on gaze data from expert and novice dentists interpreting dental radiographs using a local alignment similarity score. Our approach was capable of distinguishing experts from novices with 93% accuracy while incorporating the image semantics. Moreover, our scanpath comparison using image patch features has the potential to incorporate task semantics from a variety of tasks

preprint2020arXiv

Driver Intention Anticipation Based on In-Cabin and Driving Scene Monitoring

Numerous car accidents are caused by improper driving maneuvers. Serious injuries are however avoidable if such driving maneuvers are detected beforehand and the driver is assisted accordingly. In fact, various recent research has focused on the automated prediction of driving maneuver based on hand-crafted features extracted mainly from in-cabin driver videos. Since the outside view from the traffic scene may also contain informative features for driving maneuver prediction, we present a framework for the detection of the drivers' intention based on both in-cabin and traffic scene videos. More specifically, we (1) propose a Convolutional-LSTM (ConvLSTM)-based auto-encoder to extract motion features from the out-cabin traffic, (2) train a classifier which considers motions from both in- and outside of the cabin jointly for maneuver intention anticipation, (3) experimentally prove that the in- and outside image features have complementary information. Our evaluation based on the publicly available dataset Brain4cars shows that our framework achieves a prediction with the accuracy of 83.98% and F1-score of 84.3%.

preprint2020arXiv

Learning to Validate the Quality of Detected Landmarks

We present a new loss function for the validation of image landmarks detected via Convolutional Neural Networks (CNN). The network learns to estimate how accurate its landmark estimation is. This loss function is applicable to all regression-based location estimations and allows the exclusion of unreliable landmarks from further processing. In addition, we formulate a novel batch balancing approach which weights the importance of samples based on their produced loss. This is done by computing a probability distribution mapping on an interval from which samples can be selected using a uniform random selection scheme. We conducted experiments on the 300W, AFLW, and WFLW facial landmark datasets. In the first experiments, the influence of our batch balancing approach is evaluated by comparing it against uniform sampling. In addition, we evaluated the impact of the validation loss on the landmark accuracy based on uniform sampling. The last experiments evaluate the correlation of the validation signal with the landmark accuracy. All experiments were performed for all three datasets.

preprint2020arXiv

Training Decision Trees as Replacement for Convolution Layers

We present an alternative layer to convolution layers in convolutional neural networks (CNNs). Our approach reduces the complexity of convolutions by replacing it with binary decisions. Those binary decisions are used as indexes to conditional distributions where each weight represents a leaf in a decision tree. This means that only the indices to the weights need to be determined once, thus reducing the complexity of convolutions by the depth of the output tensor. Index computation is performed by simple binary decisions that require fewer cycles compared to conventionally used multiplications. In addition, we show how convolutions can be replaced by binary decisions. These binary decisions form indices in the conditional distributions and we show how they are used to replace 2D weight matrices as well as 3D weight tensors. These new layers can be trained like convolution layers in CNNs based on the backpropagation algorithm, for which we provide a formalization. Our results on multiple publicly available data sets show that our approach performs similar to conventional neuronal networks. Beyond the formalized reduction of complexity and the improved qualitative performance, we show the runtime improvement empirically compared to convolution layers.