Researcher profile

Jianjun Hu

Jianjun Hu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2025arXiv

In context learning Foundation models for Materials Property Prediction with Small datasets

Foundation models (FMs) have recently shown remarkable in-context learning (ICL) capabilities across diverse scientific domains. In this work, we introduce a unified in-context learning foundation model (ICL-FM) framework for materials property prediction that integrates both composition-based and structure-aware representations. The proposed approach couples the pretrained TabPFN transformer with graph neural network (GNN)-derived embeddings and our novel MagpieEX descriptors. MagpieEX augments traditional features with cation-anion interaction data to explicitly measure bond ionicity and charge-transfer asymmetry, capturing interatomic bonding characteristics that influence vibrational and thermal transport properties. Comprehensive experiments on the MatBench benchmark suite and a standalone lattice thermal conductivity (LTC) dataset demonstrate that ICL-FM achieves competitive or superior performance to state-of-the-art (SOTA) models with significantly reduced training costs. Remarkably, the training-free ICL-FM outperformed sophisticated SOTA GNN models in five out of six representative composition-based tasks, including a significant 9.93\% improvement in phonon frequency prediction. On the LTC dataset, the FM effectively models complex phenomena such as phonon-phonon scattering and atomic mass contrast. t-SNE analysis reveals that the FM acts as a physics-aware feature refiner, transforming raw, disjoint feature clusters into continuous manifolds with gradual property transitions. This restructured latent space enhances interpolative prediction accuracy while aligning learned representations with underlying physical laws. This study establishes ICL-FM as a generalizable, data-efficient paradigm for materials informatics.

preprint2023arXiv

Discovery of 2D materials using Transformer Network based Generative Design

Two-dimensional (2D) materials have wide applications in superconductors, quantum, and topological materials. However, their rational design is not well established, and currently less than 6,000 experimentally synthesized 2D materials have been reported. Recently, deep learning, data-mining, and density functional theory (DFT)-based high-throughput calculations are widely performed to discover potential new materials for diverse applications. Here we propose a generative material design pipeline, namely material transformer generator(MTG), for large-scale discovery of hypothetical 2D materials. We train two 2D materials composition generators using self-learning neural language models based on Transformers with and without transfer learning. The models are then used to generate a large number of candidate 2D compositions, which are fed to known 2D materials templates for crystal structure prediction. Next, we performed DFT computations to study their thermodynamic stability based on energy-above-hull and formation energy. We report four new DFT-verified stable 2D materials with zero e-above-hull energies, including NiCl$_4$, IrSBr, CuBr$_3$, and CoBrCl. Our work thus demonstrates the potential of our MTG generative materials design pipeline in the discovery of novel 2D materials and other functional materials.

preprint2022arXiv

Crystal Transformer: Self-learning neural language model for Generative and Tinkering Design of Materials

Self-supervised neural language models have recently achieved unprecedented success, from natural language processing to learning the languages of biological sequences and organic molecules. These models have demonstrated superior performance in the generation, structure classification, and functional predictions for proteins and molecules with learned representations. However, most of the masking-based pre-trained language models are not designed for generative design, and their black-box nature makes it difficult to interpret their design logic. Here we propose BLMM Crystal Transformer, a neural network based probabilistic generative model for generative and tinkering design of inorganic materials. Our model is built on the blank filling language model for text generation and has demonstrated unique advantages in learning the "materials grammars" together with high-quality generation, interpretability, and data efficiency. It can generate chemically valid materials compositions with as high as 89.7\% charge neutrality and 84.8\% balanced electronegativity, which are more than 4 and 8 times higher compared to a pseudo random sampling baseline. The probabilistic generation process of BLMM allows it to recommend tinkering operations based on learned materials chemistry and makes it useful for materials doping. Combined with the TCSP crysal structure prediction algorithm, We have applied our model to discover a set of new materials as validated using DFT calculations. Our work thus brings the unsupervised transformer language models based generative artificial intelligence to inorganic materials. A user-friendly web app has been developed for computational materials doping and can be accessed freely at \url{www.materialsatlas.org/blmtinker}.

preprint2022arXiv

DeepXRD, a Deep Learning Model for Predicting of XRD spectrum from Materials Composition

One of the long-standing problems in materials science is how to predict a material's structure and then its properties given only its composition. Experimental characterization of crystal structures has been widely used for structure determination, which is however too expensive for high-throughput screening. At the same time, directly predicting crystal structures from compositions remains a challenging unsolved problem. Herein we propose a deep learning algorithm for predicting the XRD spectrum given only the composition of a material, which can then be used to infer key structural features for downstream structural analysis such as crystal system or space group classification or crystal lattice parameter determination or materials property predictions. Benchmark studies on two datasets show that our DeepXRD algorithm can achieve good performance for XRD prediction as evaluated over our test sets. It can thus be used in high-throughput screening in the huge materials composition space for new materials discovery.

preprint2022arXiv

Designing novel protein structures using sequence generator and AlphaFold2

Protein structures and functions are determined by a contiguous arrangement of amino acid sequences. Designing novel protein sequences and structures with desired geometry and functions is a complex task with large state spaces. Here we develop a novel protein design pipeline consisting of two deep learning algorithms, ProteinSolver and AlphaFold2. ProteinSolver is a deep graph neural network that generates amino acid sequences such that the forces between interacting amino acids are favorable and compatible with the fold while AlphaFold2 is a deep learning algorithm that predicts the protein structures from protein sequences. We present forty de novo designed binding sites of the PTP1B and P53 proteins with high precision, out of which thirty proteins are novel. Using ProteinSolver and AlphaFold2 in conjunction, we can trim the exploration of the large protein conformation space, thus expanding the ability to find novel and diverse de novo protein designs.

preprint2022arXiv

Genetic programming-based learning of carbon interatomic potential for materials discovery

Efficient and accurate interatomic potential functions are critical to computational study of materials while searching for structures with desired properties. Traditionally, potential functions or energy landscapes are designed by experts based on theoretical or heuristic knowledge. Here, we propose a new approach to leverage strongly typed parallel genetic programming (GP) for potential function discovery. We use a multi-objective evolutionary algorithm with NSGA-III selection to optimize individual age, fitness, and complexity through symbolic regression. With a DFT dataset of 863 unique carbon allotrope configurations drawn from 858 carbon structures, the generated potentials are able to predict total energies within $\pm 7.70$ eV at low computational cost while generalizing well across multiple carbon structures. Our code is open source and available at \url{http://www.github.com/usccolumbia/mlpotential

preprint2022arXiv

Materials Transformers Language Models for Generative Materials Design: a benchmark study

Pre-trained transformer language models on large unlabeled corpus have produced state-of-the-art results in natural language processing, organic molecule design, and protein sequence generation. However, no such models have been applied to learn the composition patterns of inorganic materials. Here we train a series of seven modern transformer language models (GPT, GPT-2, GPT-Neo, GPT-J, BLMM, BART, and RoBERTa) using the expanded formulas from material deposited in the ICSD, OQMD, and Materials Projects databases. Six different datasets with/out non-charge-neutral or balanced electronegativity samples are used to benchmark the performances and uncover the generation biases of modern transformer models for the generative design of materials compositions. Our extensive experiments showed that the causal language models based materials transformers can generate chemically valid materials compositions with as high as 97.54\% to be charge neutral and 91.40\% to be electronegativity balanced, which has more than 6 times higher enrichment compared to a baseline pseudo-random sampling algorithm. These models also demonstrate high novelty and their potential in new materials discovery has been proved by their capability to recover the leave-out materials. We also find that the properties of the generated samples can be tailored by training the models with selected training sets such as high-bandgap materials. Our experiments also showed that different models each have their own preference in terms of the properties of the generated samples and their running time complexity varies a lot. We have applied our materials transformer models to discover a set of new materials as validated using DFT calculations.

preprint2021arXiv

Active learning based generative design for the discovery of wide bandgap materials

Active learning has been increasingly applied to screening functional materials from existing materials databases with desired properties. However, the number of known materials deposited in the popular materials databases such as ICSD and Materials Project is extremely limited and consists of just a tiny portion of the vast chemical design space. Herein we present an active generative inverse design method that combines active learning with a deep variational autoencoder neural network and a generative adversarial deep neural network model to discover new materials with a target property in the whole chemical design space. The application of this method has allowed us to discover new thermodynamically stable materials with high band gap (SrYF$_5$) and semiconductors with specified band gap ranges (SrClF$_3$, CaClF$_5$, YCl$_3$, SrC$_2$F$_3$, AlSCl, As$_2$O$_3$), all of which are verified by the first principle DFT calculations. Our experiments show that while active learning itself may sample chemically infeasible candidates, these samples help to train effective screening models for filtering out materials with desired properties from the hypothetical materials created by the generative model. The experiments show the effectiveness of our active generative inverse design approach.

preprint2021arXiv

Contact Map based Crystal Structure Prediction using Global Optimization

Crystal structure prediction is now playing an increasingly important role in discovery of new materials. Global optimization methods such as genetic algorithms (GA) and particle swarm optimization (PSO) have been combined with first principle free energy calculations to predict crystal structures given composition or only a chemical system. While these approaches can exploit certain crystal patterns such as symmetry and periodicity in their search process, they usually do not exploit the large amount of implicit rules and constraints of atom configurations embodied in the large number of known crystal structures. They currently can only handle crystal structure prediction of relatively small systems. Inspired by the knowledge-rich protein structure prediction approach, herein we explore whether known geometric constraints such as the atomic contact map of a target crystal material can help predict its structure given its space group information. We propose a global optimization based algorithm, CMCrystal, for crystal structure reconstruction based on atomic contact maps. Based on extensive experiments using six global optimization algorithms, we show that it is viable to reconstruct the crystal structure given the atomic contact map for some crystal materials but more constraints are needed for other target materials to achieve successful reconstruction. This implies that atomic interaction information learned from existing materials can be used to improve crystal structure prediction.

preprint2021arXiv

Crystal structure prediction using age-fitness multi-objective genetic algorithm and coordination number constraints

Crystal structure prediction (CSP) has emerged as one of the most important approaches for discovering new materials. CSP algorithms based on evolutionary algorithms and particle swarm optimization have discovered a great number of new materials. However, these algorithms based on ab initio calculation of free energy are inefficient. Moreover, they have severe limitations in terms of scalability. We recently proposed a promising crystal structure prediction method based on atomic contact maps, using global optimization algorithms to search for the Wyckoff positions by maximizing the match between the contact map of the predicted structure and the contact map of the true crystal structure. However, our previous contact map based CSP algorithms have two major limitations: (1) the loss of search capability due to getting trapped in local optima; (2) it only uses the connection of atoms in the unit cell to predict the crystal structure, ignoring the chemical environment outside the unit cell, which may lead to unreasonable coordination environments. Herein we propose a novel multi-objective genetic algorithms for contact map-based crystal structure prediction by optimizing three objectives, including contact map match accuracy, the individual age, and the coordination number match. Furthermore, we assign the age values to all the individuals of the GA and try to minimize the age aiming to avoid the premature convergence problem. Our experimental results show that compared to our previous CMCrystal algorithm, our multi-objective crystal structure prediction algorithm (CMCrystalMOO) can reconstruct the crystal structure with higher quality and alleviate the problem of premature convergence.

preprint2021arXiv

NODE-SELECT: A Graph Neural Network Based On A Selective Propagation Technique

While there exists a wide variety of graph neural networks (GNN) for node classification, only a minority of them adopt mechanisms that effectively target noise propagation during the message-passing procedure. Additionally, a very important challenge that significantly affects graph neural networks is the issue of scalability which limits their application to larger graphs. In this paper we propose our method named NODE-SELECT: an efficient graph neural network that uses subsetting layers which only allow the best sharing-fitting nodes to propagate their information. By having a selection mechanism within each layer which we stack in parallel, our proposed method NODE-SELECT is able to both reduce the amount noise propagated and adapt the restrictive sharing concept observed in real world graphs. Our NODE-SELECT significantly outperformed existing GNN frameworks in noise experiments and matched state-of-the art results in experiments without noise over different benchmark datasets.

preprint2021arXiv

SoundCLR: Contrastive Learning of Representations For Improved Environmental Sound Classification

Environmental Sound Classification (ESC) is a challenging field of research in non-speech audio processing. Most of current research in ESC focuses on designing deep models with special architectures tailored for specific audio datasets, which usually cannot exploit the intrinsic patterns in the data. However recent studies have surprisingly shown that transfer learning from models trained on ImageNet is a very effective technique in ESC. Herein, we propose SoundCLR, a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance, which works by learning representations that disentangle the samples of each class from those of other classes. Our deep network models are trained by combining a contrastive loss that contributes to a better probability output by the classification layer with a cross-entropy loss on the output of the classifier layer to map the samples to their respective 1-hot encoded labels. Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline and apply the augmentations on both the sound signals and their log-mel spectrograms before inputting them to the model. Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance. Our extensive benchmark experiments show that our hybrid deep network models trained with combined contrastive and cross-entropy loss achieved the state-of-the-art performance on three benchmark datasets ESC-10, ESC-50, and US8K with validation accuracies of 99.75\%, 93.4\%, and 86.49\% respectively. The ensemble version of our models also outperforms other top ensemble methods. The code is available at https://github.com/alireza-nasiri/SoundCLR.

preprint2020arXiv

Distance Matrix based Crystal Structure Prediction using Evolutionary Algorithms

Crystal structure prediction (CSP) for inorganic materials is one of the central and most challenging problems in materials science and computational chemistry. This problem can be formulated as a global optimization problem in which global search algorithms such as genetic algorithms (GA) and particle swarm optimization have been combined with first principle free energy calculations to predict crystal structures given only a material composition or only a chemical system. These DFT based ab initio CSP algorithms are computationally demanding and can only be used to predict crystal structures of relatively small systems. The vast coordinate space plus the expensive DFT free energy calculations limits their efficiency and effectiveness. On the other hand, a similar structure prediction problem has been intensively investigated in parallel in the protein structure prediction community of bioinformatics, in which the dominating predictors are knowledge based approaches including homology modeling and threading that exploit known protein structures. Herein we explore whether known geometric constraints such as the pairwise atomic distances of a target crystal material can help predict/reconstruct its structure given its space group and lattice information. We propose DMCrystal, a genetic algorithm based crystal structure reconstruction algorithm based on predicted atomic pairwise distances. Based on extensive experiments, we show that the predicted distance matrix can dramatically help to reconstruct the crystal structure and usually achieves much better performance than CMCrystal, an atomic contact map based crystal structure prediction algorithm. This implies that knowledge of atomic interaction information learned from existing materials can be used to significantly improve the crystal structure prediction in terms of both speed and quality.

preprint2020arXiv

Global Attention based Graph Convolutional Neural Networks for Improved Materials Property Prediction

Machine learning (ML) methods have gained increasing popularity in exploring and developing new materials. More specifically, graph neural network (GNN) has been applied in predicting material properties. In this work, we develop a novel model, GATGNN, for predicting inorganic material properties based on graph neural networks composed of multiple graph-attention layers (GAT) and a global attention layer. Through the application of the GAT layers, our model can efficiently learn the complex bonds shared among the atoms within each atom's local neighborhood. Subsequently, the global attention layer provides the weight coefficients of each atom in the inorganic crystal material which are used to considerably improve our model's performance. Notably, with the development of our GATGNN model, we show that our method is able to both outperform the previous models' predictions and provide insight into the crystallization of the material.

preprint2020arXiv

Machine Learning based prediction of noncentrosymmetric crystal materials

Noncentrosymmetric materials play a critical role in many important applications such as laser technology, communication systems,quantum computing, cybersecurity, and etc. However, the experimental discovery of new noncentrosymmetric materials is extremely difficult. Here we present a machine learning model that could predict whether the composition of a potential crystalline structure would be centrosymmetric or not. By evaluating a diverse set of composition features calculated using matminer featurizer package coupled with different machine learning algorithms, we find that Random Forest Classifiers give the best performance for noncentrosymmetric material prediction, reaching an accuracy of 84.8% when evaluated with 10 fold cross-validation on the dataset with 82,506 samples extracted from Materials Project. A random forest model trained with materials with only 3 elements gives even higher accuracy of 86.9%. We apply our ML model to screen potential noncentrosymmetric materials from 2,000,000 hypothetical materials generated by our inverse design engine and report the top 20 candidate noncentrosymmetric materials with 2 to 4 elements and top 20 borate candidates

preprint2019arXiv

Generative adversarial networks (GAN) based efficient sampling of chemical space for inverse design of inorganic materials

A major challenge in materials design is how to efficiently search the vast chemical design space to find the materials with desired properties. One effective strategy is to develop sampling algorithms that can exploit both explicit chemical knowledge and implicit composition rules embodied in the large materials database. Here, we propose a generative machine learning model (MatGAN) based on a generative adversarial network (GAN) for efficient generation of new hypothetical inorganic materials. Trained with materials from the ICSD database, our GAN model can generate hypothetical materials not existing in the training dataset, reaching a novelty of 92.53% when generating 2 million samples. The percentage of chemically valid (charge neutral and electronegativity balanced) samples out of all generated ones reaches 84.5% by our GAN when trained with materials from ICSD even though no such chemical rules are explicitly enforced in our GAN model, indicating its capability to learn implicit chemical composition rules. Our algorithm could be used to speed up inverse design or computational screening of inorganic materials.

preprint2018arXiv

A Deep Learning Algorithm for One-step Contour Aware Nuclei Segmentation of Histopathological Images

This paper addresses the task of nuclei segmentation in high-resolution histopathological images. We propose an auto- matic end-to-end deep neural network algorithm for segmenta- tion of individual nuclei. A nucleus-boundary model is introduced to predict nuclei and their boundaries simultaneously using a fully convolutional neural network. Given a color normalized image, the model directly outputs an estimated nuclei map and a boundary map. A simple, fast and parameter-free post-processing procedure is performed on the estimated nuclei map to produce the final segmented nuclei. An overlapped patch extraction and assembling method is also designed for seamless prediction of nuclei in large whole-slide images. We also show the effectiveness of data augmentation methods for nuclei segmentation task. Our experiments showed our method outperforms prior state-of-the- art methods. Moreover, it is efficient that one 1000X1000 image can be segmented in less than 5 seconds. This makes it possible to precisely segment the whole-slide image in acceptable time