Researcher profile

Stevan Harnad

Stevan Harnad contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - Emerging
11works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

Tolerance Principle and Small Language Model Learning

Modern language models like GPT-3, BERT, and LLaMA require massive training data, yet with sufficient training they reliably learn to distinguish grammatical from ungrammatical sentences. Children aged as young as 14 months already have the capacity to learn abstract grammar rules from very few exemplars, even in the presence of non-rule-following exceptions. Yang's (2016) Tolerance Principle defines a precise threshold for how many exceptions a rule can tolerate and still be learnable. The present study explored the minimal amount and quality of training data necessary for rules to be generalized by a transformer-based language model to test the predictions of the Tolerance Principle. We trained BabyBERTa (Huebner et al. 2021), a transformer model optimized for small datasets, on artificial grammars. The training sets varied in size, number of unique sentence types, and proportion of rule-following versus exception exemplars. We found that, unlike human infants, BabyBERTa's learning dynamics do not align with the Tolerance Principle.

preprint2016arXiv

Estimating Open Access Mandate Effectiveness: The MELIBEA Score

MELIBEA is a Spanish database that uses a composite formula with eight weighted conditions to estimate the effectiveness of Open Access mandates (registered in ROARMAP). We analyzed 68 mandated institutions for publication years 2011-2013 to determine how well the MELIBEA score and its individual conditions predict what percentage of published articles indexed by Web of Knowledge is deposited in each institution's OA repository, and when. We found a small but significant positive correlation (0.18) between MELIBEA score and deposit percentage. We also found that for three of the eight MELIBEA conditions (deposit timing, internal use, and opt-outs), one value of each was strongly associated with deposit percentage or deposit latency (immediate deposit required, deposit required for performance evaluation, unconditional opt-out allowed for the OA requirement but no opt-out for deposit requirement). When we updated the initial values and weights of the MELIBEA formula for mandate effectiveness to reflect the empirical association we had found, the score's predictive power doubled (.36). There are not yet enough OA mandates to test further mandate conditions that might contribute to mandate effectiveness, but these findings already suggest that it would be useful for future mandates to adopt these three conditions so as to maximize their effectiveness, and thereby the growth of OA.

preprint2016arXiv

The Latent Structure of Dictionaries

How many words (and which ones) are sufficient to define all other words? When dictionaries are analyzed as directed graphs with links from defining words to defined words, they reveal a latent structure. Recursively removing all words that are reachable by definition but that do not define any further words reduces the dictionary to a Kernel of about 10%. This is still not the smallest number of words that can define all the rest. About 75% of the Kernel turns out to be its Core, a Strongly Connected Subset of words with a definitional path to and from any pair of its words and no word's definition depending on a word outside the set. But the Core cannot define all the rest of the dictionary. The 25% of the Kernel surrounding the Core consists of small strongly connected subsets of words: the Satellites. The size of the smallest set of words that can define all the rest (the graph's Minimum Feedback Vertex Set or MinSet) is about 1% of the dictionary, 15% of the Kernel, and half-Core, half-Satellite. But every dictionary has a huge number of MinSets. The Core words are learned earlier, more frequent, and less concrete than the Satellites, which in turn are learned earlier and more frequent but more concrete than the rest of the Dictionary. In principle, only one MinSet's words would need to be grounded through the sensorimotor capacity to recognize and categorize their referents. In a dual-code sensorimotor-symbolic model of the mental lexicon, the symbolic code could do all the rest via re-combinatory definition.

preprint2013arXiv

Hidden Structure and Function in the Lexicon

How many words are needed to define all the words in a dictionary? Graph-theoretic analysis reveals that about 10% of a dictionary is a unique Kernel of words that define one another and all the rest, but this is not the smallest such subset. The Kernel consists of one huge strongly connected component (SCC), about half its size, the Core, surrounded by many small SCCs, the Satellites. Core words can define one another but not the rest of the dictionary. The Kernel also contains many overlapping Minimal Grounding Sets (MGSs), each about the same size as the Core, each part-Core, part-Satellite. MGS words can define all the rest of the dictionary. They are learned earlier, more concrete and more frequent than the rest of the dictionary. Satellite words, not correlated with age or frequency, are less concrete (more abstract) words that are also needed for full lexical power.

preprint2012arXiv

Alan Turing and the "Hard" and "Easy" Problem of Cognition: Doing and Feeling

The "easy" problem of cognitive science is explaining how and why we can do what we can do. The "hard" problem is explaining how and why we feel. Turing's methodology for cognitive science (the Turing Test) is based on doing: Design a model that can do anything a human can do, indistinguishably from a human, to a human, and you have explained cognition. Searle has shown that the successful model cannot be solely computational. Sensory-motor robotic capacities are necessary to ground some, at least, of the model's words, in what the robot can do with the things in the world that the words are about. But even grounding is not enough to guarantee that -- nor to explain how and why -- the model feels (if it does). That problem is much harder to solve (and perhaps insoluble).

preprint2012arXiv

Green and Gold Open Access Percentages and Growth, by Discipline

Most refereed journal articles today are published in subscription journals, accessible only to subscribing institutions, hence losing considerable research impact. Making articles freely accessible online ("Open Access," OA) maximizes their impact. Articles can be made OA in two ways: by self-archiving them on the web ("Green OA") or by publishing them in OA journals ("Gold OA"). We compared the percent and growth rate of Green and Gold OA for 14 disciplines in two random samples of 1300 articles per discipline out of the 12,500 journals indexed by Thomson-Reuters-ISI using a robot that trawled the web for OA full-texts. We sampled in 2009 and 2011 for publication year ranges 1998-2006 and 2005-2010, respectively. Green OA (21.4%) exceeds Gold OA (2.4%) in proportion and growth rate in all but the biomedical disciplines, probably because it can be provided for all journals articles and does not require paying extra Gold OA publication fees. The spontaneous overall OA growth rate is still very slow (about 1% per year). If institutions make Green OA self-archiving mandatory, however, it triples percent Green OA as well as accelerating its growth rate.

preprint2012arXiv

Testing the Finch Hypothesis on Green OA Mandate Ineffectiveness

We have now tested the Finch Committee's Hypothesis that Green Open Access Mandates are ineffective in generating deposits in institutional repositories. With data from ROARMAP on institutional Green OA mandates and data from ROAR on institutional repositories, we show that deposit number and rate is significantly correlated with mandate strength (classified as 1-12): The stronger the mandate, the more the deposits. The strongest mandates generate deposit rates of 70%+ within 2 years of adoption, compared to the un-mandated deposit rate of 20%. The effect is already detectable at the national level, where the UK, which has the largest proportion of Green OA mandates, has a national OA rate of 35%, compared to the global baseline of 25%. The conclusion is that, contrary to the Finch Hypothesis, Green Open Access Mandates do have a major effect, and the stronger the mandate, the stronger the effect (the Liege ID/OA mandate, linked to research performance evaluation, being the strongest mandate model). RCUK (as well as all universities, research institutions and research funders worldwide) would be well advised to adopt the strongest Green OA mandates and to integrate institutional and funder mandates.

preprint2010arXiv

Open Access Mandates and the "Fair Dealing" Button

We describe the "Fair Dealing Button," a feature designed for authors who have deposited their papers in an Open Access Institutional Repository but have deposited them as "Closed Access" (meaning only the metadata are visible and retrievable, not the full eprint) rather than Open Access. The Button allows individual users to request and authors to provide a single eprint via semi-automated email. The purpose of the Button is to tide over research usage needs during any publisher embargo on Open Access and, more importantly, to make it possible for institutions to adopt the "Immediate-Deposit/Optional-Access" Mandate, without exceptions or opt-outs, instead of a mandate that allows delayed deposit or deposit waivers, depending on publisher permissions or embargoes (or no mandate at all). This is only "Almost-Open Access," but in facilitating exception-free immediate-deposit mandates it will accelerate the advent of universal Open Access.

preprint2010arXiv

Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research

Articles whose authors make them Open Access (OA) by self-archiving them online are cited significantly more than articles accessible only to subscribers. Some have suggested that this "OA Advantage" may not be causal but just a self-selection bias, because authors preferentially make higher-quality articles OA. To test this we compared self-selective self-archiving with mandatory self-archiving for a sample of 27,197 articles published 2002-2006 in 1,984 journals. The OA Advantage proved just as high for both. Logistic regression showed that the advantage is independent of other correlates of citations (article age; journal impact factor; number of co-authors, references or pages; field; article type; or country) and greatest for the most highly cited articles. The OA Advantage is real, independent and causal, but skewed. Its size is indeed correlated with quality, just as citations themselves are (the top 20% of articles receive about 80% of all citations). The advantage is greater for the more citeable articles, not because of a quality bias from authors self-selecting what to make OA, but because of a quality advantage, from users self-selecting what to use and cite, freed by OA from the constraints of selective accessibility to subscribers only.

preprint1999arXiv

The Symbol Grounding Problem

How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our heads? How can the meanings of the meaningless symbol tokens, manipulated solely on the basis of their (arbitrary) shapes, be grounded in anything but other meaningless symbols? The problem is analogous to trying to learn Chinese from a Chinese/Chinese dictionary alone. A candidate solution is sketched: Symbolic representations must be grounded bottom-up in nonsymbolic representations of two kinds: (1) "iconic representations," which are analogs of the proximal sensory projections of distal objects and events, and (2) "categorical representations," which are learned and innate feature-detectors that pick out the invariant features of object and event categories from their sensory projections. Elementary symbols are the names of these object and event categories, assigned on the basis of their (nonsymbolic) categorical representations. Higher-order (3) "symbolic representations," grounded in these elementary symbols, consist of symbol strings describing category membership relations (e.g., "An X is a Y that is Z").