Researcher profile

Satoshi Tsutsui

Satoshi Tsutsui contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.

preprint2022arXiv

Action Recognition based on Cross-Situational Action-object Statistics

Machine learning models of visual action recognition are typically trained and tested on data from specific situations where actions are associated with certain objects. It is an open question how action-object associations in the training set influence a model's ability to generalize beyond trained situations. We set out to identify properties of training data that lead to action recognition models with greater generalization ability. To do this, we take inspiration from a cognitive mechanism called cross-situational learning, which states that human learners extract the meaning of concepts by observing instances of the same concept across different situations. We perform controlled experiments with various types of action-object associations, and identify key properties of action-object co-occurrence in training data that lead to better classifiers. Given that these properties are missing in the datasets that are typically used to train action classifiers in the computer vision literature, our work provides useful insights on how we should best construct datasets for efficiently training for better generalization.

preprint2022arXiv

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at https://github.com/showlab/AVA-AVD.

preprint2022arXiv

Bayesian Inference on Hamiltonian Selections for Mössbauer Spectroscopy

Mössbauer spectroscopy, which provides knowledge related to electronic states in materials, has been applied to various fields such as condensed matter physics and material sciences. In conventional spectral analyses based on least-square fitting, hyperfine interactions in materials have been determined from the shape of observed spectra. In conventional spectral analyses, it is difficult to discuss the validity of the hyperfine interactions and the estimated values. We propose a spectral analysis method based on Bayesian inference for the selection of hyperfine interactions and the estimation of Mössbauer parameters. An appropriate Hamiltonian has been selected by comparing Bayesian free energy among possible Hamiltonians. We have estimated the Mössbauer parameters and evaluated their estimated values by calculating the posterior distribution of each Mössbauer parameter with confidence intervals. We have also discussed the accuracy of the spectral analyses to elucidate the noise intensity dependence of numerical experiments.

preprint2022arXiv

LO-mode phonon of KCl and NaCl at 300 K by inelastic X ray scattering measurements and first principles calculations

Longitudinal-optical (LO) mode phonon branches of KCl and NaCl were measured using inelastic X-ray scattering (IXS) at 300 K and calculated by the first-principles phonon calculation with the stochastic self-consistent harmonic approximation. Spectral shapes of the IXS measurements and calculated spectral functions agreed well. We analyzed the calculated spectral functions that provide higher resolutions of the spectra than the IXS measurements. Due to strong anharmonicity, the spectral functions of these phonon branches have several peaks and the LO modes along $Γ$--L paths are disconnected.

preprint2022arXiv

Novel View Synthesis for High-fidelity Headshot Scenes

Rendering scenes with a high-quality human face from arbitrary viewpoints is a practical and useful technique for many real-world applications. Recently, Neural Radiance Fields (NeRF), a rendering technique that uses neural networks to approximate classical ray tracing, have been considered as one of the promising approaches for synthesizing novel views from a sparse set of images. We find that NeRF can render new views while maintaining geometric consistency, but it does not properly maintain skin details, such as moles and pores. These details are important particularly for faces because when we look at an image of a face, we are much more sensitive to details than when we look at other objects. On the other hand, 3D Morpable Models (3DMMs) based on traditional meshes and textures can perform well in terms of skin detail despite that it has less precise geometry and cannot cover the head and the entire scene with background. Based on these observations, we propose a method to use both NeRF and 3DMM to synthesize a high-fidelity novel view of a scene with a face. Our method learns a Generative Adversarial Network (GAN) to mix a NeRF-synthesized image and a 3DMM-rendered image and produces a photorealistic scene with a face preserving the skin details. Experiments with various real-world scenes demonstrate the effectiveness of our approach. The code will be available on https://github.com/showlab/headshot .

preprint2022arXiv

Reinforcing Generated Images via Meta-learning for One-Shot Fine-Grained Visual Recognition

One-shot fine-grained visual recognition often suffers from the problem of having few training examples for new fine-grained classes. To alleviate this problem, off-the-shelf image generation techniques based on Generative Adversarial Networks (GANs) can potentially create additional training images. However, these GAN-generated images are often not helpful for actually improving the accuracy of one-shot fine-grained recognition. In this paper, we propose a meta-learning framework to combine generated images with original images, so that the resulting "hybrid" training images improve one-shot learning. Specifically, the generic image generator is updated by a few training instances of novel classes, and a Meta Image Reinforcing Network (MetaIRNet) is proposed to conduct one-shot fine-grained recognition as well as image reinforcement. Our experiments demonstrate consistent improvement over baselines on one-shot fine-grained image classification benchmarks. Furthermore, our analysis shows that the reinforced images have more diversity compared to the original and GAN-generated images.

preprint2020arXiv

A Computational Model of Early Word Learning from the Infant's Point of View

Human infants have the remarkable ability to learn the associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science and developmental psychology have built formal models that implement in-principle learning algorithms, and then used pre-selected and pre-cleaned datasets to test the abilities of the models to find statistical regularities in the input data. In contrast to previous modeling approaches, the present study used egocentric video and gaze data collected from infant learners during natural toy play with their parents. This allowed us to capture the learning environment from the perspective of the learner's own point of view. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners. Moreover, we conducted simulation experiments to systematically determine how visual, perceptual, and attentional properties of infants' sensory experiences may affect word learning.

preprint2019arXiv

Phonon anomalies with doping in superconducting oxychlorides Ca2-xCuO2Cl2

We measure the dispersion of the Cu-O bond-stretching phonon mode in the high-temperature superconducting parent compound Ca$_2$CuO$_2$Cl$_2$. Our density functional theory calculations predict a cosine-shaped bending of the dispersion along both the ($ξ$00) and ($ξξ$0) directions, while comparison with previous results on Ca$_{1.84}$CuO$_2$Cl$_2$ show it only along ($ξ$00), suggesting an anisotropic effect which is not reproduced in calculation at optimal doping. Comparison with isostructural La$_{2-x}$Sr$_x$CuO$_4$ suggests that these calculations reproduce well the overdoped regime, however they overestimate the doping effect on the Cu-O bond-stretching mode at optimal doping.