Source author record

Lei Hou

Lei Hou appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Machine Learning astro-ph.GA cond-mat.mes-hall physics.optics physics.soc-ph Social and Information Networks astro-ph.CO Computational Engineering, Finance, and Science Computer Vision cond-mat.mtrl-sci cs.CY Human-Computer Interaction Information Retrieval physics.app-ph

Catalog footprint

What is connected

17works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

MAIC-UI: Making Interactive Courseware with Generative UI

Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200--600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) structured knowledge analysis with multi-modal understanding to ensure pedagogical rigor; (2) a two-stage generate-verify-optimize pipeline separating content alignment from visual refinement; and (3) Click-to-Locate editing with Unified Diff-based incremental generation achieving sub-10-second iteration cycles. A controlled lab study with 40 participants shows MAIC-UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text-to-HTML generation. A three-month classroom deployment with 53 high school students demonstrates that MAIC-UI fosters learning agency and reduces outcome disparities -- the pilot class achieved 9.21-point gains in STEM subjects compared to -2.32 points in control classes. Our code is available at https://github.com/THU-MAIC/MAIC-UI.

preprint2023arXiv

Syntactically Robust Training on Partially-Observed Data for Open Information Extraction

Open Information Extraction models have shown promising results with sufficient supervision. However, these models face a fundamental challenge that the syntactic distribution of training data is partially observable in comparison to the real world. In this paper, we propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution based on diverse paraphrase generation. To tackle the intrinsic problem of knowledge deformation of paraphrasing, two algorithms based on semantic similarity matching and syntactic tree walking are used to restore the expressionally transformed knowledge. The training framework can be generally applied to other syntactic partial observable domains. Based on the proposed framework, we build a new evaluation set called CaRB-AutoPara, a syntactically diverse dataset consistent with the real-world setting for validating the robustness of the models. Experiments including a thorough analysis show that the performance of the model degrades with the increase of the difference in syntactic distribution, while our framework gives a robust boundary. The source code is publicly available at https://github.com/qijimrc/RobustOIE.

preprint2022arXiv

A Roadmap for Big Model

With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.

preprint2022arXiv

A Wavelet Transform and self-supervised learning-based framework for bearing fault diagnosis with limited labeled data

Traditional supervised bearing fault diagnosis methods rely on massive labelled data, yet annotations may be very time-consuming or infeasible. The fault diagnosis approach that utilizes limited labelled data is becoming increasingly popular. In this paper, a Wavelet Transform (WT) and self-supervised learning-based bearing fault diagnosis framework is proposed to address the lack of supervised samples issue. Adopting the WT and cubic spline interpolation technique, original measured vibration signals are converted to the time-frequency maps (TFMs) with a fixed scale as inputs. The Vision Transformer (ViT) is employed as the encoder for feature extraction, and the self-distillation with no labels (DINO) algorithm is introduced in the proposed framework for self-supervised learning with limited labelled data and sufficient unlabeled data. Two rolling bearing fault datasets are used for validations. In the case of both datasets only containing 1% labelled samples, utilizing the feature vectors extracted by the trained encoder without fine-tuning, over 90\% average diagnosis accuracy can be obtained based on the simple K-Nearest Neighbor (KNN) classifier. Furthermore, the superiority of the proposed method is demonstrated in comparison with other self-supervised fault diagnosis methods.

preprint2022arXiv

KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base

Complex question answering over knowledge base (Complex KBQA) is challenging because it requires various compositional reasoning capabilities, such as multi-hop inference, attribute comparison, set operation. Existing benchmarks have some shortcomings that limit the development of Complex KBQA: 1) they only provide QA pairs without explicit reasoning processes; 2) questions are poor in diversity or scale. To this end, we introduce KQA Pro, a dataset for Complex KBQA including ~120K diverse natural language questions. We introduce a compositional and interpretable programming language KoPL to represent the reasoning process of complex questions. For each question, we provide the corresponding KoPL program and SPARQL query, so that KQA Pro serves for both KBQA and semantic parsing tasks. Experimental results show that SOTA KBQA methods cannot achieve promising results on KQA Pro as on current datasets, which suggests that KQA Pro is challenging and Complex KBQA requires further research efforts. We also treat KQA Pro as a diagnostic dataset for testing multiple reasoning skills, conduct a thorough evaluation of existing models and discuss further directions for Complex KBQA. Our codes and datasets can be obtained from https://github.com/shijx12/KQAPro_Baselines.

preprint2022arXiv

LEVEN: A Large-Scale Chinese Legal Event Detection Dataset

Recognizing facts is the most fundamental step in making judgments, hence detecting events in the legal documents is important to legal case analysis tasks. However, existing Legal Event Detection (LED) datasets only concern incomprehensive event types and have limited annotated data, which restricts the development of LED methods and their downstream applications. To alleviate these issues, we present LEVEN a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types. Not only charge-related events, LEVEN also covers general events, which are critical for legal case understanding but neglected in existing LED datasets. To our knowledge, LEVEN is the largest LED dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods. The results of extensive experiments indicate that LED is challenging and needs further effort. Moreover, we simply utilize legal events as side information to promote downstream applications. The method achieves improvements of average 2.2 points precision in low-resource judgment prediction, and 1.5 points mean average precision in unsupervised case retrieval, which suggests the fundamentality of LED. The source code and dataset can be obtained from https://github.com/thunlp/LEVEN.

preprint2022arXiv

Multimodal Entity Tagging with Multimodal Knowledge Base

To enhance research on multimodal knowledge base and multimodal information processing, we propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB). We also develop a dataset for the problem using an existing MKB. In an MKB, there are entities and their associated texts and images. In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair. We solve the task by using the information retrieval paradigm and implement several baselines using state-of-the-art methods in NLP and CV. We conduct extensive experiments and make analyses on the experimental results. The results show that the task is challenging, but current technologies can achieve relatively high performance. We will release the dataset, code, and models for future research.

preprint2022arXiv

Program Transfer for Answering Complex Questions over Knowledge Bases

Program induction for answering complex questions over knowledge bases (KBs) aims to decompose a question into a multi-step program, whose execution against the KB produces the final answer. Learning to induce programs relies on a large number of parallel question-program pairs for the given KB. However, for most KBs, the gold program annotations are usually lacking, making learning difficult. In this paper, we propose the approach of program transfer, which aims to leverage the valuable program annotations on the rich-resourced KBs as external supervision signals to aid program induction for the low-resourced KBs that lack program annotations. For program transfer, we design a novel two-stage parsing framework with an efficient ontology-guided pruning strategy. First, a sketch parser translates the question into a high-level program sketch, which is the composition of functions. Second, given the question and sketch, an argument parser searches the detailed arguments from the KB for functions. During the searching, we incorporate the KB ontology to prune the search space. The experiments on ComplexWebQuestions and WebQuestionSP show that our method outperforms SOTA methods significantly, demonstrating the effectiveness of program transfer and our framework. Our codes and datasets can be obtained from https://github.com/THU-KEG/ProgramTransfer.

preprint2022arXiv

Schema-Free Dependency Parsing via Sequence Generation

Dependency parsing aims to extract syntactic dependency structure or semantic dependency structure for sentences. Existing methods suffer the drawbacks of lacking universality or highly relying on the auxiliary decoder. To remedy these drawbacks, we propose to achieve universal and schema-free Dependency Parsing (DP) via Sequence Generation (SG) DPSG by utilizing only the pre-trained language model (PLM) without any auxiliary structures or parsing algorithms. We first explore different serialization designing strategies for converting parsing structures into sequences. Then we design dependency units and concatenate these units into the sequence for DPSG. Thanks to the high flexibility of the sequence generation, our DPSG can achieve both syntactic DP and semantic DP using a single model. By concatenating the prefix to indicate the specific schema with the sequence, our DPSG can even accomplish multi-schemata parsing. The effectiveness of our DPSG is demonstrated by the experiments on widely used DP benchmarks, i.e., PTB, CODT, SDP15, and SemEval16. DPSG achieves comparable results with the first-tier methods on all the benchmarks and even the state-of-the-art (SOTA) performance in CODT and SemEval16. This paper demonstrates our DPSG has the potential to be a new parsing paradigm. We will release our codes upon acceptance.

preprint2022arXiv

Towards a General Pre-training Framework for Adaptive Learning in MOOCs

Adaptive learning aims to stimulate and meet the needs of individual learners, which requires sophisticated system-level coordination of diverse tasks, including modeling learning resources, estimating student states, and making personalized recommendations. Existing deep learning methods have achieved great success over statistical models; however, they still lack generalization for diverse tasks and suffer from insufficient capacity since they are composed of highly-coupled task-specific architectures and rely on small-scale, coarse-grained recommendation scenarios. To realize the idea of general adaptive systems proposed in pedagogical theory, with the emerging pre-training techniques in NLP, we try to conduct a practical exploration on applying pre-training to adaptive learning, to propose a unified framework based on data observation and learning style analysis, properly leveraging heterogeneous learning elements. Through a series of downstream tasks of Learning Recommendation, Learning Resource Evaluation, Knowledge Tracing, and Dropout Prediction, we find that course structures, text, and knowledge are helpful for modeling and inherently coherent to student non-sequential learning behaviors and that indirectly relevant information included in the pre-training foundation can be shared across downstream tasks to facilitate effectiveness. We finally build a simplified systematic application of adaptive learning and reflect on the insights brought back to pedagogy. The source code and dataset will be released.

preprint2019arXiv

Memories in the Photoluminescence Intermittency of Single Cesium Lead Bromide Nanocrystals

Single cesium lead bromide (CsPbBr3) nanocrystals show strong photoluminescence blinking, with on- and off- dwelling times following power-law distributions. We investigate the memory effect in the photoluminescence blinking of single CsPbBr3 nanocrystals and find positive correlations for successive on-times and successive off-times. This memory effect is not sensitive to the nature of the surface capping ligand and the embedding polymer. These observations suggest that photoluminescence intermittency and its memory are mainly controlled by intrinsic traps in the nanocrystals. These findings will help optimizing light-emitting devices based on inorganic perovskite nanocrystals.

preprint2016arXiv

The environmental dependence of the stellar mass fundamental plane of early-type galaxies

Aims. We investigate the environmental dependence of the stellar mass fundamental plane (FP$_*$) using the early-type galaxy sample from the Sloan Digital Sky Survey Data Release 7 (SDSS DR7). Methods. The FP$_*$ is calculated by replacing the luminosity in the fundamental plane (FP) with stellar mass. Based on the SDSS group catalog, we characterize the galaxy environment according to the mass of the host dark matter halo and the position in the halo. In halos with the same mass bin, the color distributions of central and satellite galaxies are different. Therefore, we calculate FP$_*$ coefficients of galaxies in different environments and compare them with those of the FP to study the contribution of the stellar population. Results. We find that coefficient $a$ of the FP$_*$ is systematically larger than that of the FP, but coefficient $b$ of the FP$_*$ is similar to the FP. Moreover, the environmental dependence of the FP$_*$ is similar to that of the FP. For central galaxies, FP$_*$ coefficients are significantly dependent on the halo mass. For satellite galaxies, the correlation between FP$_*$ coefficients and the halo mass is weak. Conclusions. We conclude that the tilt of the FP is not primarily driven by the stellar population.

preprint2015arXiv

Explosive Formation and Dynamics of Vapor Nanobubbles around a Continuously Heated Gold Nanosphere

We form sub-micrometer-sized vapor bubbles around a single laser heating gold nanoparticle in a liquid and monitor them through optical scattering of a probe laser. The fast, inertia-governed expansion is followed by a slower contraction and disappearance after some tens of nanoseconds. In a narrow range of illumination powers, bubble time traces show a clear echo signature. We attribute it to sound waves released upon the initial explosion and reflected by flat interfaces, hundreds of microns away from the particle. Echoes can trigger new explosions. A steady state of nanobubble with a vapor shell surrounding the heated nanoparticle can be reached by a proper time profile of the heating intensity. Stable nanobubbles could have original applications for light modulation and for enhanced optical-acoustic coupling in photoacoustic microscopy.

preprint2015arXiv

Stability of similarity measurements for bipartite networks

Similarity is a fundamental measure in network analyses and machine learning algorithms, with wide applications ranging from personalized recommendation to socio-economic dynamics. We argue that an effective similarity measurement should guarantee the stability even under some information loss. With six bipartite networks, we investigate the stabilities of fifteen similarity measurements by comparing the similarity matrixes of two data samples which are randomly divided from original data sets. Results show that, the fifteen measurements can be well classified into three clusters according to their stabilities, and measurements in the same cluster have similar mathematical definitions. In addition, we develop a top-$n$-stability method for personalized recommendation, and find that the unstable similarities would recommend false information to users, and the performance of recommendation would be largely improved by using stable similarity measurements. This work provides a novel dimension to analyze and evaluate similarity measurements, which can further find applications in link prediction, personalized recommendation, clustering algorithms, community detection and so on.

preprint2014arXiv

Memory effect of the online user preference

The mechanism of the online user preference evolution is of great significance for understanding the online user behaviors and improving the quality of online services. Since users are allowed to rate on objects in many online systems, ratings can well reflect the users' preference. With two benchmark datasets from online systems, we uncover the memory effect in users' selecting behavior which is the sequence of qualities of selected objects and the rating behavior which is the sequence of ratings delivered by each user. Furthermore, the memory duration is presented to describe the length of a memory, which exhibits the power-law distribution, i.e., the probability of the occurring of long-duration memory is much higher than that of the random case which follows the exponential distribution. We present a preference model in which a Markovian process is utilized to describe the users' selecting behavior, and the rating behavior depends on the selecting behavior. With only one parameter for each of the user's selecting and rating behavior, the preference model could regenerate any duration distribution ranging from the power-law form (strong memory) to the exponential form (weak memory).

preprint2014arXiv

Supercontinuum generation and carrier envelope offset frequency measurement in a tapered single-mode fiber

We report supercontinuum generation by launching femtosecond Yb fiber laser pulses into a tapered single-mode fiber of 3 um core diameter. A spectrum of more than one octave, from 550 to 1400 nm, has been obtained with an output power of 1.3 W at a repetition rate of 250 MHz, corresponding to a coupling efficiency of up to 60%. By using a typical f-2f interferometer, the carrier envelope offset frequency was measured and found to have a signal-to-noise ratio of nearly 30 dB.

preprint2014arXiv

The Fundamental Plane Relation of Early-Type Galaxies: Environmental Dependence

Using a sample of 70,793 early-type galaxies from SDSS DR7, we study the environmental dependence of the fundamental plane relation. With the help of the galaxy group catalogue based on SDSS DR7, we calculate the fundamental planes in different dark matter halo mass bins for central and satellite galaxies respectively. We find the environmental dependence of the fundamental plane coefficients are similar in $g$, $r$, $i$ and $z$ bands. The environmental dependence for central and satellite galaxies is significantly different. While the fundamental plane coefficients of centrals vary systematically with the halo mass, those of satellites are similar in different halo mass bins. The discrepancy between centrals and satellites are significant in small halos, but negligible in the largest halo mass bins. These results remain the same when we only keep red galaxies, or galaxies with $b/a>0.6$, or galaxies in a specific radius range in the sample. After the correction of the sky background, results are still similar. We suggest that the different environmental effects of the halo mass on centrals and satellites may arise from the different quenching processes of them.

Lei Hou

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

MAIC-UI: Making Interactive Courseware with Generative UI

Syntactically Robust Training on Partially-Observed Data for Open Information Extraction

A Roadmap for Big Model

A Wavelet Transform and self-supervised learning-based framework for bearing fault diagnosis with limited labeled data

KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base

LEVEN: A Large-Scale Chinese Legal Event Detection Dataset

Multimodal Entity Tagging with Multimodal Knowledge Base

Program Transfer for Answering Complex Questions over Knowledge Bases

Schema-Free Dependency Parsing via Sequence Generation

Towards a General Pre-training Framework for Adaptive Learning in MOOCs

Memories in the Photoluminescence Intermittency of Single Cesium Lead Bromide Nanocrystals

The environmental dependence of the stellar mass fundamental plane of early-type galaxies

Explosive Formation and Dynamics of Vapor Nanobubbles around a Continuously Heated Gold Nanosphere

Stability of similarity measurements for bipartite networks

Memory effect of the online user preference

Supercontinuum generation and carrier envelope offset frequency measurement in a tapered single-mode fiber

The Fundamental Plane Relation of Early-Type Galaxies: Environmental Dependence