Source author record

Hongmin Li

Hongmin Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Biological Physics Computation and Language Computer Vision nlin.SI

Catalog footprint

What is connected

9works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup

Task-agnostic structure proxies are often used to interpret why one pretraining corpus transfers better than another, but such explanations require the proxy to track the structure that matters for the downstream task. We test this requirement in a fixed pretraining-and-probing setup motivated by computationally bounded notions of learned structure, including epiplexity. The core question is whether a proxy ranking of two pretraining datasets must agree with their ranking by OOD probe accuracy. We show that it need not. First, we give a controlled construction in which a formal structure quantity, its operational proxy, and the task-relevant structure for a target family separate. We then instantiate the same mechanism in a synthetic sequence-model experiment: under the primary all-sample evaluation, the OOD accuracy ranking reverses the proxy ranking in two of three seeds, with auxiliary diagnostics and ablations supporting the same interpretation. The counterexample does not reject structure-based explanations in general; it identifies a boundary on strong proxy-based explanations. A proxy for total learned structure can fail to track the task-relevant structure that drives OOD performance, even in a controlled setting.

preprint2026arXiv

FastUMAP: Scalable Dimensionality Reduction via Bipartite Landmark Sampling

Exploratory analysis of high-dimensional data rarely stops at a single embedding. In practice, analysts rerun dimensionality reduction after changing preprocessing, subsets, or hyperparameters, and standard nonlinear methods can quickly become the bottleneck. We introduce FastUMAP (Bipartite Manifold Approximation and Projection), a landmark-based method designed for this repeated-use setting. FastUMAP builds a sparse point-landmark fuzzy graph, computes a Nystrom spectral warm start from the induced landmark affinity, and then refines all sample coordinates with a UMAP-style objective on the bipartite graph. The landmark ratio r = m/n provides a direct way to trade runtime against fidelity. On 9 benchmark datasets spanning 178 to 70,000 samples, FastUMAP has the lowest runtime on 7 datasets in our reported default-implementation comparison on one workstation. On MNIST and Fashion-MNIST (n=70000), it runs in about 4.6 seconds, compared with about 73--75 seconds for Barnes--Hut t-SNE, while reaching 91.4% mean kNN accuracy versus 94.6% for the strongest accuracy baseline. FastUMAP is therefore best viewed as a fast option for repeated exploratory embedding, rather than as a replacement for accuracy-first methods.

preprint2026arXiv

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

Semi-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts. In this work, we present the first empirical evaluation of large language model (LLM) guided semi-supervised learning for crisis related tweet classification. We compare two recent LLM assisted semi-supervised methods, VerifyMatch and LLM guided Co-Training ( LG-CoTrain), against established semi-supervised baselines. Our results show that LG-CoTrain significantly outperforms classical semi-supervised approaches in low resource settings with 5, 10 and 25 labeled examples per class, achieving the highest averaged Macro F1 across events. VerifyMatch achieves competitive performance while also demonstrating strong calibration properties. As the number of labeled examples increases, the performance gap narrows and Self Training emerges as a strong baseline. We further observe that compact semi-supervised models can, in some cases, outperform very large LLMs operating in zero-shot settings. This finding highlights the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi-supervised learning, offering a practical pathway for real world disaster response applications. Our project repository on Github is here.

preprint2026arXiv

Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model

Shortcut features are often invoked to explain out-of-distribution (OOD) failure, but training correlation, learned shortcut use, and test-time failure need not coincide. We study a minimal binary model with one invariant coordinate and one family-dependent shortcut coordinate. In the deterministic regime, positive average shortcut correlation pulls logistic ERM toward positive shortcut weight, but ridge regularization keeps the classifier invariant-dominated and prevents deterministic OOD failure. When the invariant coordinate is noisy, ridge-logistic ERM switches to the shortcut rule once the training shortcut signal exceeds the invariant signal. Whether that transition causes failure depends on the held-out family: weaker shortcut correlation yields positive excess risk, and sign-flipped families yield above-chance error. Synthetic checks match these analytic regimes and show that the same training-side transition can have different held-out consequences. The model separates shortcut attraction, shortcut-rule transition, and cross-family OOD failure.

preprint2026arXiv

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.

preprint2022arXiv

Divide-and-conquer based Large-Scale Spectral Clustering

Spectral clustering is one of the most popular clustering methods. However, how to balance the efficiency and effectiveness of the large-scale spectral clustering with limited computing resources has not been properly solved for a long time. In this paper, we propose a divide-and-conquer based large-scale spectral clustering method to strike a good balance between efficiency and effectiveness. In the proposed method, a divide-and-conquer based landmark selection algorithm and a novel approximate similarity matrix approach are designed to construct a sparse similarity matrix within low computational complexities. Then clustering results can be computed quickly through a bipartite graph partition process. The proposed method achieves a lower computational complexity than most existing large-scale spectral clustering methods. Experimental results on ten large-scale datasets have demonstrated the efficiency and effectiveness of the proposed method. The MATLAB code of the proposed method and experimental datasets are available at https://github.com/Li-Hongmin/MyPaperWithCode.

preprint2015arXiv

Real-time Tracking Based on Neuromrophic Vision

Real-time tracking is an important problem in computer vision in which most methods are based on the conventional cameras. Neuromorphic vision is a concept defined by incorporating neuromorphic vision sensors such as silicon retinas in vision processing system. With the development of the silicon technology, asynchronous event-based silicon retinas that mimic neuro-biological architectures has been developed in recent years. In this work, we combine the vision tracking algorithm of computer vision with the information encoding mechanism of event-based sensors which is inspired from the neural rate coding mechanism. The real-time tracking of single object with the advantage of high speed of 100 time bins per second is successfully realized. Our method demonstrates that the computer vision methods could be used for the neuromorphic vision processing and we can realize fast real-time tracking using neuromorphic vision sensors compare to the conventional camera.

preprint2013arXiv

A new integrable discrete generalized nonlinear Schrodinger equation and its reductions

A new integrable discrete system is constructed and studied, based on the algebraization of the difference operator. The model is named the discrete generalized nonlinear Schrodinger (GNLS) equation for which can be reduced to classical discrete nonlinear Schrodinger (NLS) equation. To show the complete integrability of the discrete GNLS equation, the recursion operator, symmetries and conservation quantities are obtained. Furthermore, all of reductions for the discrete GNLS equation are given and the discrete NLS equation is obtained by one of the reductions. At the same time, the recursion operator and symmetries of continuous GNLS equation are successfully recovered by its corresponding discrete ones.

preprint2013arXiv

Inexpensive hardware and software for photon statistics and correlation spectroscopy

Single-molecule sensitive microscopies and spectroscopies are transforming biophysics and materials science laboratories. Techniques such as fluorescence correlation spectroscopy (FCS) and single-molecule sensitive fluorescence resonance energy transfer (FRET) are now commonly available in research laboratories but are as yet infrequently available in teaching laboratories. We describe inexpensive electronics and open-source software that bridges this gap, making state-of-the-art measurement research capabilities accessible to undergraduates interested in biophysics. We include a pedagogical discussion of the intensity correlation function relevant to FCS and its calculation directly from photon arrival times. We demonstrate the system with a measurement of the hydrodynamic radius of a protein using FCS that is suitable for an undergraduate teaching laboratory. The FPGA-based electronics, which are easy to construct, are suitable for more advanced measurements as well, and several applications are demonstrated. As implemented, the system has 8 ns timing resolution, outputs to control up to four laser sources, and inputs for as many as four photon-counting detectors.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint