Source author record

Shashwat Goel

Shashwat Goel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Machine Learning Artificial Intelligence astro-ph.EP astro-ph.IM

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

FutureSim: Replaying World Events to Evaluate Adaptive Agents

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

preprint2026arXiv

Scaling Open-Ended Reasoning to Predict the Future

High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

preprint2020arXiv

A Search for Technosignatures Around 31 Sun-like Stars with the Green Bank Telescope at 1.15-1.73 GHz

We conducted a search for technosignatures in April of 2018 and 2019 with the L-band receiver (1.15-1.73 GHz) of the 100 m diameter Green Bank Telescope. These observations focused on regions surrounding 31 Sun-like stars near the plane of the Galaxy. We present the results of our search for narrowband signals in this data set as well as improvements to our data processing pipeline. Specifically, we applied an improved candidate signal detection procedure that relies on the topographic prominence of the signal power, which nearly doubles the signal detection count of some previously analyzed data sets. We also improved the direction-of-origin filters that remove most radio frequency interference (RFI) to ensure that they uniquely link signals observed in separate scans. We performed a preliminary signal injection and recovery analysis to test the performance of our pipeline. We found that our pipeline recovers 93% of the injected signals over the usable frequency range of the receiver and 98% if we exclude regions with dense RFI. In this analysis, 99.73% of the recovered signals were correctly classified as technosignature candidates. Our improved data processing pipeline classified over 99.84% of the ~26 million signals detected in our data as RFI. Of the remaining candidates, 4539 were detected outside of known RFI frequency regions. The remaining candidates were visually inspected and verified to be of anthropogenic nature. Our search compares favorably to other recent searches in terms of end-to-end sensitivity, frequency drift rate coverage, and signal detection count per unit bandwidth per unit integration time.