Source author record

Paul Ginsparg

Paul Ginsparg appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

13works

23topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.

preprint2021arXiv

Attention-based Quantum Tomography

With rapid progress across platforms for quantum systems, the problem of many-body quantum state reconstruction for noisy quantum states becomes an important challenge. Recent works found promise in recasting the problem of quantum state reconstruction to learning the probability distribution of quantum state measurement vectors using generative neural network models. Here we propose the "Attention-based Quantum Tomography" (AQT), a quantum state reconstruction using an attention mechanism-based generative network that learns the mixed state density matrix of a noisy quantum state. The AQT is based on the model proposed in "Attention is all you need" by Vishwani et al (2017) that is designed to learn long-range correlations in natural language sentences and thereby outperform previous natural language processing models. We demonstrate not only that AQT outperforms earlier neural-network-based quantum state reconstruction on identical tasks but that AQT can accurately reconstruct the density matrix associated with a noisy quantum state experimentally realized in an IBMQ quantum computer. We speculate the success of the AQT stems from its ability to model quantum entanglement across the entire quantum system much as the attention model for natural language processing captures the correlations among words in a sentence.

preprint2021arXiv

Experimental error mitigation using linear rescaling for variational quantum eigensolving with up to 20 qubits

Quantum computers have the potential to help solve a range of physics and chemistry problems, but noise in quantum hardware currently limits our ability to obtain accurate results from the execution of quantum-simulation algorithms. Various methods have been proposed to mitigate the impact of noise on variational algorithms, including several that model the noise as damping expectation values of observables. In this work, we benchmark various methods, including a new method proposed here. We compare their performance in estimating the ground-state energies of several instances of the 1D mixed-field Ising model using the variational-quantum-eigensolver algorithm with up to 20 qubits on two of IBM's quantum computers. We find that several error-mitigation techniques allow us to recover energies to within 10% of the true values for circuits containing up to about 25 ansatz layers, where each layer consists of CNOT gates between all neighboring qubits and Y-rotations on all qubits.

preprint2020arXiv

Sensitivity of collective outcomes identifies pivotal components

A social system is susceptible to perturbation when its collective properties depend sensitively on a few pivotal components. Using the information geometry of minimal models from statistical physics, we develop an approach to identify pivotal components to which coarse-grained, or aggregate, properties are sensitive. As an example, we introduce our approach on a reduced toy model with a median voter who always votes in the majority. The sensitivity of majority-minority divisions to changing voter behaviour pinpoints the unique role of the median. More generally, the sensitivity identifies pivotal components that precisely determine collective outcomes generated by a complex network of interactions. Using perturbations to target pivotal components in the models, we analyse datasets from political voting, finance and Twitter. Across these systems, we find remarkable variety, from systems dominated by a median-like component to those whose components behave more equally. In the context of political institutions such as courts or legislatures, our methodology can help describe how changes in voters map to new collective voting outcomes. For economic indices, differing system response reflects varying fiscal conditions across time. Thus, our information-geometric approach provides a principled, quantitative framework that may help assess the robustness of collective outcomes to targeted perturbation and compare social institutions, or even biological networks, with one another and across time.

preprint2015arXiv

Text Segmentation based on Semantic Word Embeddings

We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation. By developing a general framework for discussing a class of segmentation objectives, we study the effectiveness of greedy versus exact optimization approaches and suggest a new iterative refinement technique for improving the performance of greedy strategies. We compare our results to known benchmarks, using known metrics. We demonstrate state-of-the-art performance for an untrained method with our Content Vector Segmentation (CVS) on the Choi test set. Finally, we apply the segmentation procedure to an in-the-wild dataset consisting of text extracted from scholarly articles in the arXiv.org database.

preprint2014arXiv

Kenneth G. Wilson: Renormalized After-Dinner Anecdotes

This is the transcript of the after-dinner talk I gave at the close of the 16 Nov 2013 symposium "Celebrating the Science of Kenneth Geddes Wilson" [1] at Cornell University (see Fig. 1 for the poster). The video of my talk is on-line [2], and this transcript is more or less verbatim, with the slides used included as figures. I've also annotated it with a few clarifying footnotes, and provided references to the source materials where available. The talk itself pulls together anecdotes from various points in his career, discusses my own graduate student experiences with him, and finishes with some video excerpts from an interview he did in 2010.

preprint2014arXiv

Patterns of Text Reuse in a Scientific Corpus

We consider the incidence of text "reuse" by researchers, via a systematic pairwise comparison of the text content of all articles deposited to arXiv.org from 1991--2012. We measure the global frequencies of three classes of text reuse, and measure how chronic text reuse is distributed among authors in the dataset. We infer a baseline for accepted practice, perhaps surprisingly permissive compared with other societal contexts, and a clearly delineated set of aberrant authors. We find a negative correlation between the amount of reused text in an article and its influence, as measured by subsequent citations. Finally, we consider the distribution of countries of origin of articles containing large amounts of reused text.

preprint2011arXiv

It was twenty years ago today ...

To mark the 20th anniversary of the (14 Aug 1991) commencement of hep-th@xxx.lanl.gov (now arXiv.org), I've adapted this article from one that first appeared in Physics World (2008), was later reprinted (with permission) in Learned Publishing (2009), but never appeared in arXiv. I trace some historical context and early development of the resource, its later trajectory, and close with some thoughts about the future. This version is closer to my original draft, with some updates for this occasion, plus an astounding $2^5$ added footnotes.

preprint2011arXiv

Non-Abelian Braiding of Lattice Bosons

We report on a numerical experiment in which we use time-dependent potentials to braid non-abelian quasiparticles. We consider lattice bosons in a uniform magnetic field within the fractional quantum Hall regime, where $ν$, the ratio of particles to flux quanta, is near 1/2, 1 or 3/2. We introduce time-dependent potentials which move quasiparticle excitations around one another, explicitly simulating a braiding operation which could implement part of a gate in a quantum computation. We find that different braids do not commute for $ν$ near $1$ and $3/2$, with Berry matrices respectively consistent with Ising and Fibonacci anyons. Near $ν=1/2$, the braids commute.

preprint2010arXiv

Last but not Least: Additional Positional Effects on Citation and Readership in arXiv

We continue investigation of the effect of position in announcements of newly received articles, a single day artifact, with citations received over the course of ensuing years. Earlier work [arXiv:0907.4740, arXiv:0805.0307] focused on the "visibility" effect for positions near the beginnings of announcements, and on the "self-promotion" effect associated to authors intentionally aiming for these positions, with both found correlated to a later enhanced citation rate. Here we consider a "reverse-visibility" effect for positions near the ends of announcements, and on a "procrastination" effect associated to submissions made within the 20 minute period just before the daily deadline. For two large subcommunities of theoretical high energy physics, we find a clear "reverse-visibility" effect, in which articles near the ends of the lists receive a boost in both short-term readership and long-term citations, almost comparable in size to the "visibility" effect documented earlier. For one of those subcommunities, we find an additional "procrastination" effect, in which last position articles submitted shortly before the deadline have an even higher citation rate than those that land more accidentally in that position. We consider and eliminate geographic effects as responsible for the above, and speculate on other possible causes, including "oblivious" and "nightowl" effects.

preprint2009arXiv

Positional Effects on Citation and Readership in arXiv

arXiv.org mediates contact with the literature for entire scholarly communities, both through provision of archival access and through daily email and web announcements of new materials, potentially many screenlengths long. We confirm and extend a surprising correlation between article position in these initial announcements, ordered by submission time, and later citation impact, due primarily to intentional "self-promotion" on the part of authors. A pure "visibility" effect was also present: the subset of articles accidentally in early positions fared measurably better in the long-term citation record than those lower down. Astrophysics articles announced in position 1, for example, overall received a median number of citations 83\% higher, while those there accidentally had a 44\% visibility boost. For two large subcommunities of theoretical high energy physics, hep-th and hep-ph articles announced in position 1 had median numbers of citations 50\% and 100\% larger than for positions 5--15, and the subsets there accidentally had visibility boosts of 38\% and 71\%. We also consider the positional effects on early readership. The median numbers of early full text downloads for astro-ph, hep-th, and hep-ph articles announced in position 1 were 82\%, 61\%, and 58\% higher than for lower positions, respectively, and those there accidentally had medians visibility-boosted by 53\%, 44\%, and 46\%. Finally, we correlate a variety of readership features with long-term citations, using machine learning methods, thereby extending previous results on the predictive power of early readership in a broader context. We conclude with some observations on impact metrics and dangers of recommender mechanisms.

preprint1992arXiv

Strings on Curved Spacetimes: Black Holes, Torsion, and Duality

We present a general discussion of strings propagating on noncompact coset spaces $G/H$ in terms of gauged WZW models, emphasizing the role played by isometries in the existence of target space duality. Fixed points of the gauged transformations induce metric singularities and, in the case of abelian subgroups $H$, become horizons in a dual geometry. We also give a classification of models with a single timelike coordinate together with an explicit list for dimensions $D\leq 10$. We study in detail the class of models described by the cosets $SL(2,\IR)\otimes SO(1,1)^{D-2}/SO(1,1)$. For $D\geq 2$ each coset represents two different spacetime geometries: (2D black hole)$\otimes \IR^{D-2}$ and (3D black string)$\otimes \IR^{D-3}$ with nonvanishing torsion. They are shown to be dual in such a way that the singularity of the former geometry (which is not due to a fixed point) is mapped to a regular surface (i.e.\ not even a horizon) in the latter . These cosets also lead to the conformal field theory description of known and new cosmological string models.

preprint1986arXiv

Desperately Seeking Superstrings

We provide a detailed analysis of the problems and prospects of superstring theory c. 1986, anticipating much of the progress of the decades to follow.