Source author record

Shantanu Jain

Shantanu Jain appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

astro-ph.SR Machine Learning Artificial Intelligence Computation and Language Quantitative Methods

Catalog footprint

What is connected

5works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Maximal growth rate of the ascending phase of a sunspot cycle for predicting its amplitude

Forecasting the solar cycle amplitude is important for a better understanding of the solar dynamo as well as for many space weather applications. We demonstrated a steady relationship between the maximal growth rate of sunspot activity in the ascending phase of a cycle and the subsequent cycle amplitude on the basis of four data sets of solar activity indices: total sunspot numbers, hemispheric sunspot numbers from the new catalogue from 1874 onwards, total sunspot areas, and hemispheric sunspot areas. For all the data sets, a linear regression based on the maximal growth rate precursor shows a significant correlation. Validation of predictions for cycles 1-24 shows high correlations between the true and predicted cycle amplitudes reaching r = 0.93 for the total sunspot numbers. The lead time of the predictions varies from 2 to 49 months, with a mean value of 21 months. Furthermore, we demonstrated that the sum of maximal growth rate indicators determined separately for the north and the south hemispheric sunspot numbers provides more accurate predictions than that using total sunspot numbers. The advantages reach 27% and 11% on average in terms of rms and correlation coefficient, respectively. The superior performance is also confirmed with hemispheric sunspot areas with respect to total sunspot areas. The maximal growth rate of sunspot activity in the ascending phase of a solar cycle serves as a reliable precursor of the subsequent cycle amplitude. Furthermore, our findings provide a strong foundation for supporting regular monitoring, recording, and predictions of solar activity with hemispheric sunspot data, which capture the asymmetric behaviour of the solar activity and solar magnetic field and enhance solar cycle prediction methods.

preprint2022arXiv

WebGPT: Browser-assisted question-answering with human feedback

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

preprint2021arXiv

Hemispheric sunspot numbers 1874--2020

We create a continuous series of daily and monthly hemispheric sunspot numbers (HSNs) from 1874 to 2020, which will be continuously expanded in the future with the HSNs provided by SILSO. Based on the available daily measurements of hemispheric sunspot areas from 1874 to 2016 from Greenwich Royal Observatory and NOAA, we derive the relative fractions of the northern and southern activity. These fractions are applied to the international sunspot number (ISN) to derive the HSNs. This method and obtained data are validated against published HSNs for the period 1945--2020. We provide a continuous data series and catalogue of daily, monthly mean, and 13-month smoothed monthly mean HSNs for the time range 1874--2020 that are consistent with the newly calibrated ISN. Validation of the reconstructed HSNs against the direct data available since 1945 reveals a high level of consistency, with a correlation of r=0.94 (0.97) for the daily (monthly) data. The cumulative hemispheric asymmetries for cycles 12-24 give a mean value of 16%, with no obvious pattern in north-south predominance over the cycle evolution. The strongest asymmetry occurs for cycle no. 19, in which the northern hemisphere shows a cumulated predominance of 42%. The phase shift between the peaks of solar activity in the two hemispheres may be up to 28 months, with a mean absolute value of 16.4 months. The phase shifts reveal an overall asymmetry of the northern hemisphere reaching its cycle maximum earlier (in 10 out of 13 cases). Relating the ISN and HSN peak growth rates during the cycle rise phase with the cycle amplitude reveals higher correlations when considering the two hemispheres individually, with r = 0.9. Our findings demonstrate that empirical solar cycle prediction methods can be improved by investigating the solar cycle dynamics in terms of the hemispheric sunspot numbers.

preprint2020arXiv

New mixture models for decoy-free false discovery rate estimation in mass-spectrometry proteomics

Motivation: Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target decoy approaches (TDAs) and decoy-free approaches (DFAs), have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results: We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs, and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms.

preprint2016arXiv

Nonparametric semi-supervised learning of class proportions

The problem of developing binary classifiers from positive and unlabeled data is often encountered in machine learning. A common requirement in this setting is to approximate posterior probabilities of positive and negative classes for a previously unseen data point. This problem can be decomposed into two steps: (i) the development of accurate predictors that discriminate between positive and unlabeled data, and (ii) the accurate estimation of the prior probabilities of positive and negative examples. In this work we primarily focus on the latter subproblem. We study nonparametric class prior estimation and formulate this problem as an estimation of mixing proportions in two-component mixture models, given a sample from one of the components and another sample from the mixture itself. We show that estimation of mixing proportions is generally ill-defined and propose a canonical form to obtain identifiability while maintaining the flexibility to model any distribution. We use insights from this theory to elucidate the optimization surface of the class priors and propose an algorithm for estimating them. To address the problems of high-dimensional density estimation, we provide practical transformations to low-dimensional spaces that preserve class priors. Finally, we demonstrate the efficacy of our method on univariate and multivariate data.