Source author record

Vamshi Krishna Bonagiri

Vamshi Krishna Bonagiri appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Human-Computer Interaction Social and Information Networks

Catalog footprint

What is connected

2works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.

preprint2022arXiv

Are Deepfakes Concerning? Analyzing Conversations of Deepfakes on Reddit and Exploring Societal Implications

Deepfakes are synthetic content generated using advanced deep learning and AI technologies. The advancement of technology has created opportunities for anyone to create and share deepfakes much easier. This may lead to societal concerns based on how communities engage with it. However, there is limited research available to understand how communities perceive deepfakes. We examined deepfake conversations on Reddit from 2018 to 2021 -- including major topics and their temporal changes as well as implications of these conversations. Using a mixed-method approach -- topic modeling and qualitative coding, we found 6,638 posts and 86,425 comments discussing concerns of the believable nature of deepfakes and how platforms moderate them. We also found Reddit conversations to be pro-deepfake and building a community that supports creating and sharing deepfake artifacts and building a marketplace regardless of the consequences. Possible implications derived from qualitative codes indicate that deepfake conversations raise societal concerns. We propose that there are implications for Human Computer Interaction (HCI) to mitigate the harm created from deepfakes.