Source author record

Yi Lu

Yi Lu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

33works

32topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

preprint2023arXiv

Side-by-Side vs Face-to-Face: Evaluating Colocated Collaboration via a Transparent Wall-sized Display

Traditional wall-sized displays mostly only support side-by-side co-located collaboration, while transparent displays naturally support face-to-face interaction. Many previous works assume transparent displays support collaboration. Yet it is unknown how exactly its afforded face-to-face interaction can support loose or close collaboration, especially compared to the side-by-side configuration offered by traditional large displays. In this paper, we used an established experimental task that operationalizes different collaboration coupling and layout locality, to compare pairs of participants collaborating side-by-side versus face-to-face in each collaborative situation. We compared quantitative measures and collected interview and observation data to further illustrate and explain our observed user behavior patterns. The results showed that the unique face-to-face collaboration brought by transparent display can result in more efficient task performance, different territorial behavior, and both positive and negative collaborative factors. Our findings provided empirical understanding about the collaborative experience supported by wall-sized transparent displays and shed light on its future design.

preprint2022arXiv

Active Learning Over Multiple Domains in Natural Language Tasks

Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multi-domain sampling to examine this challenging setting for question answering and sentiment analysis. We ask (1) what family of methods are effective for this task? And, (2) what properties of selected examples and domains achieve strong results? Among 18 acquisition functions from 4 families of methods, we find H-Divergence methods, and particularly our proposed variant DAL-E, yield effective results, averaging 2-3% improvements over the random baseline. We also show the importance of a diverse allocation of domains, as well as room-for-improvement of existing methods on both domain and example selection. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multi-domain active learning for natural language tasks.

preprint2022arXiv

An Extended Halo-based Group/Cluster finder: application to the DESI legacy imaging surveys DR8

We extend the halo-based group finder developed by \citet[][]{Yang2005a} to use data {\it simultaneously} with either photometric or spectroscopic redshifts. A mock galaxy redshift survey constructed from a high-resolution N-body simulation is used to evaluate the performance of this extended group finder. For galaxies with magnitude ${\rm z\le 21}$ and redshift $0<z\le 1.0$ in the DESI legacy imaging surveys (the Legacy Surveys), our group finder successfully identifies more than 60\% of the members in about $90\%$ of halos with mass $\ga 10^{12.5}\msunh$. Detected groups with mass $\ga 10^{12.0}\msunh$ have a purity (the fraction of true groups) greater than 90\%. The halo mass assigned to each group has an uncertainty of about 0.2 dex at the high mass end $\ga 10^{13.5}\msunh$ and 0.40 dex at the low mass end. Groups with more than 10 members have a redshift accuracy of $\sim 0.008$. We apply this group finder to the Legacy Surveys DR8 and find 5.2 Million groups with at least 3 members. About 387,000 of these groups have at least 10 members. The resulting catalog containing 3D coordinates, richness, halo masses, and total group luminosities, is made publicly available.

preprint2022arXiv

COGEDAP: A COmprehensive GEnomic Data Analysis Platform

Non-sharable sensitive data collection and analysis in large-scale consortia for genomic research is complicated. Time consuming issues in installing software arise due to different operating systems, software dependencies and running the software. Therefore, easier, more standardized, automated protocols and platforms can be a solution to overcome these issues. We have developed one such solution for genomic data analysis using software container technologies. The platform, COGEDAP, consists of different software tools placed into Singularity containers with corresponding pipelines and instructions on how to perform genome-wide association studies (GWAS) and other genomic data analysis via corresponding tools. Using a provided helper script written in Python, users can obtain auto-generated scripts to conduct the desired analysis both on high-performance computing (HPC) systems and on personal computers. The analyses can be done by running these auto-generated scripts with the software containers. The helper script also performs minor re-formatting of the input/output data, so that the end user can work with a unified file format regardless of which genetic software is used for the analysis. COGEDAP is actively being used by users from different countries/projects to conduct their genomic data analyses. Thanks to this platform, users can easily run GWAS and other genomic analyses without spending much effort on software installation, data formats, and other technical requirements.

preprint2022arXiv

DePS: An improved deep learning model for de novo peptide sequencing

De novo peptide sequencing from mass spectrometry data is an important method for protein identification. Recently, various deep learning approaches were applied for de novo peptide sequencing and DeepNovoV2 is one of the represetative models. In this study, we proposed an enhanced model, DePS, which can improve the accuracy of de novo peptide sequencing even with missing signal peaks or large number of noisy peaks in tandem mass spectrometry data. It is showed that, for the same test set of DeepNovoV2, the DePS model achieved excellent results of 74.22%, 74.21% and 41.68% for amino acid recall, amino acid precision and peptide recall respectively. Furthermore, the results suggested that DePS outperforms DeepNovoV2 on the cross species dataset.

preprint2022arXiv

High spectral-resolution interferometry down to 1 micron with Asgard/BIFROST at VLTI: Science drivers and project overview

We present science cases and instrument design considerations for the BIFROST instrument that will open the short-wavelength (Y/J/H-band), high spectral dispersion (up to R=25,000) window for the VLT Interferometer. BIFROST will be part of the Asgard Suite of instruments and unlock powerful venues for studying accretion & mass-loss processes at the early/late stages of stellar evolution, for detecting accreting protoplanets around young stars, and for probing the spin-orbit alignment in directly-imaged planetary systems and multiple star systems. Our survey on GAIA binaries aims to provide masses and precision ages for a thousand stars, providing a legacy data set for improving stellar evolutionary models as well as for Galactic Archaeology. BIFROST will enable off-axis spectroscopy of exoplanets in the 0.025-1" separation range, enabling high-SNR, high spectral resolution follow-up of exoplanets detected with ELT and JWST. We give an update on the status of the project, outline our key technology choices, and discuss synergies with other instruments in the proposed Asgard Suite of instruments.

preprint2022arXiv

Magnetism in doped infinite-layer NdNiO2 studied by combined density functional theory and dynamical mean-field theory

The recent observation of superconductivity in infinite-layer nickelates has brought intense debate on the established knowledge of unconventional superconductivity based on the cuprates. Despite many similarities, the nickelates differ from the cuprates in many characteristics, the most notable one among which is the magnetism. Instead of a canonical antiferromagnetic Mott insulator as the undoped cuprates, from which the superconductivity is generally believed to arise upon doping, the undoped nickelates show no sign of magnetic ordering in experiments. Through a combined density functional theory, dynamical mean-field theory, and model study, we show that although the increased energy splitting between O-$p$ orbital and Cu/Ni-$d$ orbital ($Δ_{dp}$) results in larger magnetic moment in nickelates, it also leads to stronger antiferromagnetism/ferromagnetism competition, and weaker magnetic exchange coupling. Meanwhile, the self-doping effect caused by Nd-$d$ orbital screens the magnetic moment of Ni. The Janus-faced effect of $Δ_{dp}$ and self-doping effect together give a systematic understanding of magnetic behavior in nickelates and explain recent experimental observations.

preprint2022arXiv

Matrix Syncer -- A Multi-chain Data Aggregator For Supporting Blockchain-based Metaverses

Due to the rising complexity of the metaverse's business logic and the low-latency nature of the metaverse, developers typically encounter the challenge of effectively reading, writing, and retrieving historical on-chain data in order to facilitate their functional implementations at scale. While it is true that accessing blockchain states is simple, more advanced real-world operations such as search, aggregation, and conditional filtering are not available when interacting directly with blockchain networks, particularly when dealing with requirements for on-chain event reflection. We offer Matrix Syncer, the ultimate middleware that bridges the data access gap between blockchains and end-user applications. Matrix Syncer is designed to facilitate the consolidation of on-chain information into a distributed data warehouse while also enabling customized on-chain state transformation for a scalable storage, access, and retrieval. It offers a unified layer for both on- and off-chain state, as well as a fast and flexible atomic query. Matrix Syncer is easily incorporated into any infrastructure to aggregate data from various blockchains concurrently, such as Ethereum and Flow. The system has been deployed to support several metaverse projects with a total value of more than $15 million USD.

preprint2022arXiv

Nuclear states projected from a pair condensate

Atomic nuclei exhibit deformation, pairing correlations, and rotational symmetries. To meet these competing demands in a computationally tractable formalism, we revisit the use of general pair condensates with good particle number as a trial wave function for even-even nuclei. After minimizing the energy of the condensate, we project out states with good angular momentum with a fast projection technique, allowing for general triaxial deformations. To show applicability, we present example calculations from pair condensates in several model spaces, and compare against projected Hartree-Fock and full configuration-interaction shell model calculations. This approach successfully generates spherical, vibrational and rotational spectra, demonstrating potential for modeling medium- to heavy-mass nuclei.

preprint2020arXiv

BAlN alloy for enhanced two-dimensional electron gas characteristics of GaN-based high electron mobility transistor

The emerging wide bandgap BAlN alloys have potentials for improved III-nitride power devices including high electron mobility transistor (HEMT). Yet few relevant studies have been carried. In this work, we have investigated the use of the B0.14Al0.86N alloy as part or entirety of the interlayer between the GaN buffer and the AlGaN barrier in the conventional GaN-based high electron mobility transistor (HEMT). The numerical results show considerable improvement of the two-dimensional electron gas (2DEG) concentration with small 2DEG leakage into the ternary layer by replacing the conventional AlN interlayer by either the B0.14Al0.86N interlayer or the B0.14Al0.86N/AlN hybrid interlayer. Consequently, the transfer characteristics can be improved. The saturation current can be enhanced as well. For instance, the saturation currents for HEMTs with the 0.5 nm B0.14Al0.86N/0.5 nm AlN hybrid interlayer and the 1 nm B0.14Al0.86N interlayer are 5.8% and 2.2% higher than that for the AlN interlayer when VGS-Vth= +3 V.

preprint2020arXiv

BAlN for III-nitride UV light emitting diodes: undoped electron blocking layer

The undoped BAlN electron-blocking layer (EBL) is investigated to replace the conventional AlGaN EBL in light-emitting diodes (LEDs). Numerical studies of the impact of variously doped EBLs on the output characteristics of LEDs demonstrate that the LED performance shows heavy dependence on the p-doping level in the case of the AlGaN EBL, while it shows less dependence on the p-doping level for the BAlN EBL. As a result, we propose an undoped BAlN EBL for LEDs to avoid the p-doping issues, which a major technical challenge in the AlGaN EBL. Without doping, the proposed BAlN EBL structure still possesses a superior capacity in blocking electrons and improving hole injection compared with the AlGaN EBL having high doping. This study provides a feasible route to addressing electron leakage and insufficient hole injection issues when designing UV LED structures.

preprint2020arXiv

ContourRend: A Segmentation Method for Improving Contours by Rendering

A good object segmentation should contain clear contours and complete regions. However, mask-based segmentation can not handle contour features well on a coarse prediction grid, thus causing problems of blurry edges. While contour-based segmentation provides contours directly, but misses contours' details. In order to obtain fine contours, we propose a segmentation method named ContourRend which adopts a contour renderer to refine segmentation contours. And we implement our method on a segmentation model based on graph convolutional network (GCN). For the single object segmentation task on cityscapes dataset, the GCN-based segmentation con-tour is used to generate a contour of a single object, then our contour renderer focuses on the pixels around the contour and predicts the category at high resolution. By rendering the contour result, our method reaches 72.41% mean intersection over union (IoU) and surpasses baseline Polygon-GCN by 1.22%.

preprint2020arXiv

Exact sum rules with approximate ground states

Electromagnetic and weak transitions tell us a great deal about the structure of atomic nuclei. Yet modeling transitions can be difficult: it is often easier to compute the ground state, if only as an approximation, than excited states. One alternative is through transition sum rules, in particular the non-energy-weighted and energy-weighted sum rules, which can be computed as expectation values of operators. We investigate by computing sum rules for a variety of nuclei, comparing the numerically exact full configuration-interaction shell model, as a reference, to Hartree-Fock, projected Hartree-Fock, and the nucleon pair approximation. These approximations yield reasonable agreement, which we explain by prior work on the systematics of transition moments.

preprint2020arXiv

Graph Computing based Distributed State Estimation with PMUs

Power system state estimation plays a fundamental and critical role in the energy management system (EMS). To achieve a high performance and accurate system states estimation, a graph computing based distributed state estimation approach is proposed in this paper. Firstly, a power system network is divided into multiple areas. Reference buses are selected with PMUs being installed at these buses for each area. Then, the system network is converted into multiple independent areas. In this way, the power system state estimation could be conducted in parallel for each area and the estimated system states are obtained without compromise of accuracy. IEEE 118-bus system and MP 10790-bus system are employed to verify the results accuracy and present the promising computation performance.

preprint2020arXiv

Graph-FCN for image semantic segmentation

Semantic segmentation with deep learning has achieved great progress in classifying the pixels in the image. However, the local location information is usually ignored in the high-level feature extraction by the deep learning, which is important for image semantic segmentation. To avoid this problem, we propose a graph model initialized by a fully convolutional network (FCN) named Graph-FCN for image semantic segmentation. Firstly, the image grid data is extended to graph structure data by a convolutional network, which transforms the semantic segmentation problem into a graph node classification problem. Then we apply graph convolutional network to solve this graph node classification problem. As far as we know, it is the first time that we apply the graph convolutional network in image semantic segmentation. Our method achieves competitive performance in mean intersection over union (mIOU) on the VOC dataset(about 1.34% improvement), compared to the original FCN model.

preprint2020arXiv

Polar Rectification Effect in Electro-Fatigued SrTiO3 Based Junctions

Rectifying semiconductor junctions are crucial to electronic devices. They convert alternating current into direct one by allowing unidirectional charge flows. In analogy to the current-flow rectification for itinerary electrons, here, a polar rectification that based on the localized oxygen vacancies (OVs) in a Ti/fatigued-SrTiO3 (fSTO) Schottky junction is first demonstrated. The fSTO with OVs is produced by an electro-degradation process. The different movability of localized OVs and itinerary electrons in the fSTO yield a unidirectional electric polarization at the interface of the junction under the coaction of external and built-in electric fields. Moreover, the fSTO displays a pre-ferroelectric state located between paraelectric and ferroelectric phases. The pre-ferroelectric state has three sub-states and can be easily driven into a ferroelectric state by external electric field. These observations open up opportunities for potential polar devices and may underpin many useful polar-triggered electronic phenomena.

preprint2020arXiv

Polarity induced electronic and atomic reconstruction at NdNiO2/SrTiO3 interfaces

Superconductivity has recently been observed in Sr-doped NdNiO2 films grown on SrTiO3. Whether it is caused by or related to the interface remains an open question. To address this issue, we use density functional theory calculation and charge transfer self-consistent model to study the effects of polar discontinuity on the electronic and atomic reconstruction at the NdNiO2/SrTiO3 interface. We find that sharp interface with pure electronic reconstruction only is energetically unfavorable, and atomic reconstruction is unavoidable. We further propose a possible interface configuration that contain residual apical oxygen. These oxygen atoms lead to hybrids of dz2 and dx2-y2 states at the Fermi level, which weaken the single-band feature and may be detrimental to superconductivity.

preprint2020arXiv

Populating HI gas in dark matter halos: I. method

We combine data from the Sloan Digital Sky Survey (SDSS) and the Arecibo Legacy Fast ALFA Survey (ALFALFA) to establish an empirical model for the HI gas content within dark matter halos. A cross-match between our SDSS DR7 galaxy group sample and the ALFALFA HI sources provides a catalog of 16,520 HI-galaxy pairs within 14,270 galaxy groups (halos). Using these matched pairs, we model the HI gas mass distributions within halos using two components: 1) {\it in situ} galaxy relations that involve the HI masses, colors $({\rm g-r})$ and stellar masses 2) an {\it ex situ} dependence of the HI mass on the halo mass/environment. We find that if we solely use galaxy associated scaling relations to predict the HI gas distribution (solely component 1), the number of HI detections is significantly over-predicted with respect the ALFALFA observations. We introduce a concept for the survival of the HI masses/members within halos of different masses labelled as the `efficiency' factor, in order to describe the probability that a halo has in retaining its HI detections. Taking the above consideration into account we construct a `halo based HI mass model' which does not only predict the HI masses of galaxies, but also yields similar number, stellar, halo mass and satellite fraction distributions to the HI detections retrieved from observational data.

preprint2020arXiv

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

We propose a new application of embedding techniques for problem retrieval in adaptive tutoring. The objective is to retrieve problems whose mathematical concepts are similar. There are two challenges: First, like sentences, problems helpful to tutoring are never exactly the same in terms of the underlying concepts. Instead, good problems mix concepts in innovative ways, while still displaying continuity in their relationships. Second, it is difficult for humans to determine a similarity score that is consistent across a large enough training set. We propose a hierarchical problem embedding algorithm, called Prob2Vec, that consists of abstraction and embedding steps. Prob2Vec achieves 96.88\% accuracy on a problem similarity test, in contrast to 75\% from directly applying state-of-the-art sentence embedding methods. It is interesting that Prob2Vec is able to distinguish very fine-grained differences among problems, an ability humans need time and effort to acquire. In addition, the sub-problem of concept labeling with imbalanced training data set is interesting in its own right. It is a multi-label problem suffering from dimensionality explosion, which we propose ways to ameliorate. We propose the novel negative pre-training algorithm that dramatically reduces false negative and positive ratios for classification, using an imbalanced training data set.

preprint2020arXiv

Topotactic hydrogen in nickelate superconductors and akin infinite-layer oxides ABO2

Superconducting nickelates appear to be difficult to synthesize. Since the chemical reduction of ABO3 (A: rare earth; B transition metal) with CaH2 may result in both, ABO2 and ABO2H, we calculate the topotactic H binding energy by density functional theory (DFT). We find intercalating H is energetically favorable for LaNiO2 but not for Sr-doped NdNiO2. This has dramatic consequences for the electronic structure as determined by DFT+dynamical mean field theory: that of 3d9 LaNiO2 is similar to (doped) cuprates, 3d8 LaNiO2H is a two-orbital Mott insulator. Topotactical H might hence explain why some nickelates are superconducting and others are not.

preprint2016arXiv

A regional compound Poisson process for hurricane and tropical storm damage

In light of intense hurricane activity along the U.S. Atlantic coast, attention has turned to understanding both the economic impact and behaviour of these storms. The compound Poisson-lognormal process has been proposed as a model for aggregate storm damage, but does not shed light on regional analysis since storm path data are not used. In this paper, we propose a fully Bayesian regional prediction model which uses conditional autoregressive (CAR) models to account for both storm paths and spatial patterns for storm damage. When fitted to historical data, the analysis from our model both confirms previous findings and reveals new insights on regional storm tendencies. Posterior predictive samples can also be used for pricing regional insurance premiums, which we illustrate using three different risk measures.

preprint2016arXiv

An empirical model to form and evolve galaxies in dark matter halos

Based on the star formation histories (SFH) of galaxies in halos of different masses, we develop an empirical model to grow galaxies in dark mattet halos. This model has very few ingredients, any of which can be associated to observational data and thus be efficiently assessed. By applying this model to a very high resolution cosmological $N$-body simulation, we predict a number of galaxy properties that are a very good match to relevant observational data. Namely, for both centrals and satellites, the galaxy stellar mass function (SMF) up to redshift $z\simeq4$ and the conditional stellar mass functions (CSMF) in the local universe are in good agreement with observations. In addition, the 2-point correlation is well predicted in the different stellar mass ranges explored by our model. Furthermore, after applying stellar population synthesis models to our stellar composition as a function of redshift, we find that the luminosity functions in $^{0.1}u$, $^{0.1}g$, $^{0.1}r$, $^{0.1}i$ and $^{0.1}z$ bands agree quite well with the SDSS observational results down to an absolute magnitude at about -17.0. The SDSS conditional luminosity functions (CLF) itself is predicted well. Finally, the cold gas is derived from the star formation rate (SFR) to predict the HI gas mass within each mock galaxy. We find a remarkably good match to observed HI-to-stellar mass ratios. These features ensure that such galaxy/gas catalogs can be used to generate reliable mock redshift surveys.

preprint2016arXiv

Galaxy groups in the 2MASS Redshift Survey

A galaxy group catalog is constructed from the 2MASS Redshift Survey (2MRS) with the use of a halo-based group finder. The halo mass associated with a group is estimated using a `GAP' method based on the luminosity of the central galaxy and its gap with other member galaxies. Tests using mock samples shows that this method is reliable, particularly for poor systems containing only a few members. On average 80% of all the groups have completeness >0.8, and about 65% of the groups have zero contamination. Halo masses are estimated with a typical uncertainty $\sim 0.35\,{\rm dex}$. The application of the group finder to the 2MRS gives 29,904 groups from a total of 43,246 galaxies at $z \leq 0.08$, with 5,286 groups having two or more members. Some basic properties of this group catalog is presented, and comparisons are made with other groups catalogs in overlap regions. With a depth to $z\sim 0.08$ and uniformly covering about 91% of the whole sky, this group catalog provides a useful data base to study galaxies in the local cosmic web, and to reconstruct the mass distribution in the local Universe.

preprint2016arXiv

Mapping the real space distributions of galaxies in SDSS DR7: I. Two Point Correlation Functions

Using a method to correct redshift space distortion (RSD) for individual galaxies, we mapped the real space distributions of galaxies in the Sloan Digital Sky Survey (SDSS) Data Release 7 (DR7). We use an ensemble of mock catalogs to demonstrate the reliability of our method. Here as the first paper in a series, we mainly focus on the two point correlation function (2PCF) of galaxies. Overall the 2PCF measured in the reconstructed real space for galaxies brighter than $^{0.1}{\rm M}_r-5\log h=-19.0$ agrees with the direct measurement to an accuracy better than the measurement error due to cosmic variance, if the reconstruction uses the correct cosmology. Applying the method to the SDSS DR7, we construct a real space version of the main galaxy catalog, which contains 396,068 galaxies in the North Galactic Cap with redshifts in the range $0.01 \leq z \leq 0.12$. The Sloan Great Wall, the largest known structure in the nearby Universe, is not as dominant an over-dense structure as appears to be in redshift space. We measure the 2PCFs in reconstructed real space for galaxies of different luminosities and colors. All of them show clear deviations from single power-law forms, and reveal clear transitions from 1-halo to 2-halo terms. A comparison with the corresponding 2PCFs in redshift space nicely demonstrates how RSDs boost the clustering power on large scales (by about $40-50\%$ at scales $\sim 10 h^{-1}{\rm {Mpc}}$) and suppress it on small scales (by about $70-80\%$ at a scale of $0.3 h^{-1}{\rm {Mpc}}$).

preprint2016arXiv

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly "think like a vertex" programming interface to users, and exhibit good horizontal scalability. However, these systems are designed for tasks where the majority of graph vertices participate in computation, but are not suitable for processing light-workload graph queries where only a small fraction of vertices need to be accessed. The programming paradigm adopted by these systems can seriously under-utilize the resources in a cluster for graph query processing. In this work, we develop a new open-source system, called Quegel, for querying big graphs, which treats queries as first-class citizens in the design of its computing model. Users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand using a novel superstep-sharing execution model to effectively utilize the cluster resources. Quegel further provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph-parallel systems. Our experiments verified that Quegel is highly efficient in answering various types of graph queries and is up to orders of magnitude faster than existing systems.

preprint2015arXiv

Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation

Massive graphs, such as online social networks and communication networks, have become common today. To efficiently analyze such large graphs, many distributed graph computing systems have been developed. These systems employ the "think like a vertex" programming paradigm, where a program proceeds in iterations and at each iteration, vertices exchange messages with each other. However, using Pregel's simple message passing mechanism, some vertices may send/receive significantly more messages than others due to either the high degree of these vertices or the logic of the algorithm used. This forms the communication bottleneck and leads to imbalanced workload among machines in the cluster. In this paper, we propose two effective message reduction techniques: (1)vertex mirroring with message combining, and (2)an additional request-respond API. These techniques not only reduce the total number of messages exchanged through the network, but also bound the number of messages sent/received by any single vertex. We theoretically analyze the effectiveness of our techniques, and implement them on top of our open-source Pregel implementation called Pregel+. Our experiments on various large real graphs demonstrate that our message reduction techniques significantly improve the performance of distributed graph computation.

preprint2015arXiv

Sampling with Walsh Transforms

With the advent of massive data outputs at a regular rate, admittedly, signal processing technology plays an increasingly key role. Nowadays, signals are not merely restricted to physical sources, they have been extended to digital sources as well. Under the general assumption of discrete statistical signal sources, we propose a practical problem of sampling incomplete noisy signals for which we do not know a priori and the sample size is bounded. We approach this sampling problem by Shannon's channel coding theorem. We use an extremal binary channel with high probability of transmission error, which is rare in communication theory. Our main result demonstrates that it is the large Walsh coefficient(s) that characterize(s) discrete statistical signals, regardless of the signal sources. Note that this is a known fact in specific application domains such as images. By the connection of Shannon's theorem, we establish the necessary and sufficient condition for our generic sampling problem for the first time. Finally, we discuss the cryptographic significance of sparse Walsh transform.

preprint2015arXiv

Using member galaxy luminosities as halo mass proxies of galaxy groups

Reliable halo mass estimation for a given galaxy system plays an important role both in cosmology and galaxy formation studies. Here we set out to find the way that can improve the halo mass estimation for those galaxy systems with limited brightest member galaxies been observed. Using four mock galaxy samples constructed from semi-analytical formation models, the subhalo abundance matching method and the conditional luminosity functions, respectively, we find that the luminosity gap between the brightest and the subsequent brightest member galaxies in a halo (group) can be used to significantly reduce the scatter in the halo mass estimation based on the luminosity of the brightest galaxy alone. Tests show that these corrections can significantly reduce the scatter in the halo mass estimations by $\sim 50\%$ to $\sim 70\%$ in massive halos depending on which member galaxies are considered. Comparing to the traditional ranking method, we find that this method works better for groups with less than five members, or in observations with very bright magnitude cut.

preprint2013arXiv

Constraining the Star Formation Histories in Dark Matter Halos: I. Central Galaxies

Using the self-consistent modeling of the conditional stellar mass functions across cosmic time by Yang et al. (2012), we make model predictions for the star formation histories (SFHs) of {\it central} galaxies in halos of different masses. The model requires the following two key ingredients: (i) mass assembly histories of central and satellite galaxies, and (ii) local observational constraints of the star formation rates of central galaxies as function of halo mass. We obtain a universal fitting formula that describes the (median) SFH of central galaxies as function of halo mass, galaxy stellar mass and redshift. We use this model to make predictions for various aspects of the star formation rates of central galaxies across cosmic time. Our main findings are the following. (1) The specific star formation rate (SSFR) at high $z$ increases rapidly with increasing redshift [$\propto (1+z)^{2.5}$] for halos of a given mass and only slowly with halo mass ($\propto M_h^{0.12}$) at a given $z$, in almost perfect agreement with the specific mass accretion rate of dark matter halos. (2) The ratio between the star formation rate (SFR) in the main-branch progenitor and the final stellar mass of a galaxy peaks roughly at a constant value, $\sim 10^{-9.3} h^2 {\rm yr}^{-1}$, independent of halo mass or the final stellar mass of the galaxy. However, the redshift at which the SFR peaks increases rapidly with halo mass. (3) More than half of the stars in the present-day Universe were formed in halos with $10^{11.1}\msunh < M_h < 10^{12.3}\msunh$ in the redshift range $0.4 < z < 1.9$. (4) ... [abridged]

preprint2013arXiv

Decay of tails at equilibrium for FIFO join the shortest queue networks

In join the shortest queue networks, incoming jobs are assigned to the shortest queue from among a randomly chosen subset of $D$ queues, in a system of $N$ queues; after completion of service at its queue, a job leaves the network. We also assume that jobs arrive into the system according to a rate-$αN$ Poisson process, $α<1$, with rate-1 service at each queue. When the service at queues is exponentially distributed, it was shown in Vvedenskaya et al. [Probl. Inf. Transm. 32 (1996) 15-29] that the tail of the equilibrium queue size decays doubly exponentially in the limit as $N\rightarrow\infty$. This is a substantial improvement over the case D=1, where the queue size decays exponentially. The reasoning in [Probl. Inf. Transm. 32 (1996) 15-29] does not easily generalize to jobs with nonexponential service time distributions. A modularized program for treating general service time distributions was introduced in Bramson et al. [In Proc. ACM SIGMETRICS (2010) 275-286]. The program relies on an ansatz that asserts, in equilibrium, any fixed number of queues become independent of one another as $N\rightarrow\infty$. This ansatz was demonstrated in several settings in Bramson et al. [Queueing Syst. 71 (2012) 247-292], including for networks where the service discipline is FIFO and the service time distribution has a decreasing hazard rate. In this article, we investigate the limiting behavior, as $N\rightarrow \infty$, of the equilibrium at a queue when the service discipline is FIFO and the service time distribution has a power law with a given exponent $-β$, for $β>1$. We show under the above ansatz that, as $N\rightarrow\infty$, the tail of the equilibrium queue size exhibits a wide range of behavior depending on the relationship between $β$ and $D$. In particular, if $β>D/(D-1)$, the tail is doubly exponential and, if $β<D/(D-1)$, the tail has a power law. When $β=D/(D-1)$, the tail is exponentially distributed.

preprint2010arXiv

Log-Poisson Non-Gaussianity of Ly$α$ Transmitted Flux Fluctuations at High Redshift

We investigate the non-Gaussian features of the IGM at redshift $z\sim 5 - 6$ using Ly$α$ transmitted flux of quasar absorption spectra and cosmological hydrodynamic simulation of the concordance $Λ$CDM universe. We show that the neutral hydrogen mass density field and Ly$α$ transmitted flux fluctuations possess all the non-Gaussian features predicted by the log-Poisson hierarchy, which depends only on two dimensionless parameters $β$ and $γ$, describing, respectively, the intermittence and singularity of the random fields. We find that the non-Gaussianity of the Ly$α$ transmitted flux of quasars from $z=4.9$ to $z=6.3$ can be well reconstructed by the hydrodynamical simulation samples. Although the Gunn-Peterson optical depth and its variance underwent a significant evolution in the redshift range of $5 - 6$, the intermittency measured by $β$ is almost redshift-independent in this range. More interesting, the intermittency of quasar's absorption spectra on physical scales $0.1-1$ h$^{-1}$Mpc in redshift $5 - 6$ are found to be about the same as that on physical scales $1-10$ h$^{-1}$Mpc at redshifts $2 - 4$. Considering the Jeans length is less than 0.1 h$^{-1}$Mpc at $z\sim 5$, and $1$ h$^{-1}$Mpc at $z\sim 2$, these results imply that the nonlinear evolution in high and low redshifts will lead the cosmic baryon fluid to a state similar to fully developed turbulence. The log-Poisson high order behavior of current high redshift data of quasar's spectrum can be explained by uniform UV background in the redshift range considered. We also studied the log-Poisson non-Gaussianity by considering inhomogeneous background. With several simplified models of inhomogeneous background, we found the effect of the inhomogeneous background on the log-Poisson non-Gaussianity is not larger than 1-sigma.

preprint2008arXiv

Log-Poisson Hierarchical Clustering of Cosmic Neutral Hydrogen and Ly-alpha Transmitted Flux of QSO Absorption Spectrum

we study, in this paper, the non-Gaussian features of the mass density field of neutral hydrogen fluid and the Ly-alpha transmitted flux of QSO absorption spectrum from the point-of-view of self-similar log-Poisson hierarchy. It has been shown recently that, in the scale range from the onset of nonlinear evolution to dissipation, the velocity and mass density fields of cosmic baryon fluid are extremely well described by the She-Leveque's scaling formula, which is due to the log-Poisson hierarchical cascade. Since the mass density ratio between ionized hydrogen to total hydrogen is not uniform in space, the mass density field of neutral hydrogen component is not given by a similar mapping of total baryon fluid. Nevertheless, we show, with hydrodynamic simulation samples of the concordance $Λ$CDM universe, that the mass density field of neutral hydrogen, is also well described by the log-Poisson hierarchy. We then investigate the field of Ly$α$ transmitted flux of QSO absorption spectrum. Due to redshift distortion, Ly$α$ transmitted flux fluctuations are no longer to show all features of the log-Poisson hierarchy. However, some non-Gaussian features predicted by the log-Poisson hierarchy are not affected by the redshift distortion. We test these predictions with the high resolution and high S/N data of quasars Ly$α$ absorption spectra. All results given by real data, including $β$-hierarchy, high order moments and scale-scale correlation, are found to be well consistent with the log-Poisson hierarchy. We compare the log-Poisson hierarchy with the popular log-normal model of the Ly$α$ transmitted flux. The later is found to yield too strong non-Gaussianity at high orders, while the log-Poisson hierarchy is in agreement with observed data.

Yi Lu

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Side-by-Side vs Face-to-Face: Evaluating Colocated Collaboration via a Transparent Wall-sized Display

Active Learning Over Multiple Domains in Natural Language Tasks

An Extended Halo-based Group/Cluster finder: application to the DESI legacy imaging surveys DR8

COGEDAP: A COmprehensive GEnomic Data Analysis Platform

DePS: An improved deep learning model for de novo peptide sequencing

High spectral-resolution interferometry down to 1 micron with Asgard/BIFROST at VLTI: Science drivers and project overview

Magnetism in doped infinite-layer NdNiO2 studied by combined density functional theory and dynamical mean-field theory

Matrix Syncer -- A Multi-chain Data Aggregator For Supporting Blockchain-based Metaverses

Nuclear states projected from a pair condensate

BAlN alloy for enhanced two-dimensional electron gas characteristics of GaN-based high electron mobility transistor

BAlN for III-nitride UV light emitting diodes: undoped electron blocking layer

ContourRend: A Segmentation Method for Improving Contours by Rendering

Exact sum rules with approximate ground states

Graph Computing based Distributed State Estimation with PMUs

Graph-FCN for image semantic segmentation

Polar Rectification Effect in Electro-Fatigued SrTiO3 Based Junctions

Polarity induced electronic and atomic reconstruction at NdNiO2/SrTiO3 interfaces

Populating HI gas in dark matter halos: I. method

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

Topotactic hydrogen in nickelate superconductors and akin infinite-layer oxides ABO2

A regional compound Poisson process for hurricane and tropical storm damage

An empirical model to form and evolve galaxies in dark matter halos

Galaxy groups in the 2MASS Redshift Survey

Mapping the real space distributions of galaxies in SDSS DR7: I. Two Point Correlation Functions

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation

Sampling with Walsh Transforms

Using member galaxy luminosities as halo mass proxies of galaxy groups

Constraining the Star Formation Histories in Dark Matter Halos: I. Central Galaxies

Decay of tails at equilibrium for FIFO join the shortest queue networks

Log-Poisson Non-Gaussianity of Ly$α$ Transmitted Flux Fluctuations at High Redshift

Log-Poisson Hierarchical Clustering of Cosmic Neutral Hydrogen and Ly-alpha Transmitted Flux of QSO Absorption Spectrum