Source author record

Min Yang

Min Yang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language cond-mat.mtrl-sci Machine Learning Artificial Intelligence Cryptography and Security Computer Vision cond-mat.mes-hall Information Retrieval math.ST physics.ins-det Statistics Theory hep-ex Human-Computer Interaction physics.app-ph physics.class-ph physics.med-ph quant-ph Software Engineering

Catalog footprint

What is connected

42works

18topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

preprint2026arXiv

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inherently interactive and iterative nature, similar to the peer review and rebuttal process in academic publishing. In this paper, we introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response. Extensive experiments across various LLMs reveal critical insights into model performance, including differences between proprietary and open-source models, as well as task asymmetries between examiner analysis and applicant-side rebuttal. These findings highlight both the potential and current limitations of LLMs in modeling complex, real-world legal reasoning and technical novelty judgment in patent examination. We release our code and dataset to facilitate future research on patent examination modeling.

preprint2026arXiv

Valley3: Scaling Omni Foundation Models for E-commerce

In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.

preprint2024arXiv

DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever

Recently, substantial advancements in pre-trained vision-language models have greatly enhanced the capabilities of multi-modal dialog systems. These models have demonstrated significant improvements by fine-tuning on downstream tasks. However, the existing pre-trained models primarily focus on effectively capturing the alignment between vision and language modalities, often ignoring the intricate nature of dialog context. In this paper, we propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Specifically, our approach introduces a multi-modal context prompt generator to learn context features which are subsequently distilled into prompts within the pre-trained vision-language model CLIP. Besides, we introduce domain prompt to mitigate the disc repancy from the downstream dialog data. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space, with each expert being responsible to one specific retrieval type. Extensive experiments show that DialCLIP achieves state-of-the-art performance on two widely recognized benchmark datasets (i.e., PhotoChat and MMDialog) by tuning a mere 0.04% of the total parameters. These results highlight the efficacy and efficiency of our proposed approach, underscoring its potential to advance the field of multi-modal dialog retrieval.

preprint2024arXiv

Unifying Structured Data as Graph for Data-to-Text Pre-Training

Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation and yields impressive performances. However, previous pre-training methods either oversimplified structured data into a sequence without considering input structures or designed training objectives tailored for a specific data structure (e.g., table or knowledge graph). In this paper, we unify different types of structured data (i.e., table, key-value data, knowledge graph) into the graph format and cast different data-to-text generation tasks as graph-to-text generation. To effectively exploit the structural information of the input graph, we propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer. Concretely, we devise a position matrix for the Transformer, encoding relative positional information of connected nodes in the input graph. In addition, we propose a new attention matrix to incorporate graph structures into the original Transformer by taking the available explicit connectivity structure into account. Extensive experiments on six benchmark datasets show the effectiveness of our model. Our source codes are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/unid2t.

preprint2022arXiv

A Certifiable Security Patch for Object Tracking in Self-Driving Systems via Historical Deviation Modeling

Self-driving cars (SDC) commonly implement the perception pipeline to detect the surrounding obstacles and track their moving trajectories, which lays the ground for the subsequent driving decision making process. Although the security of obstacle detection in SDC is intensively studied, not until very recently the attackers start to exploit the vulnerability of the tracking module. Compared with solely attacking the object detectors, this new attack strategy influences the driving decision more effectively with less attack budgets. However, little is known on whether the revealed vulnerability remains effective in end-to-end self-driving systems and, if so, how to mitigate the threat. In this paper, we present the first systematic research on the security of object tracking in SDC. Through a comprehensive case study on the full perception pipeline of a popular open-sourced self-driving system, Baidu's Apollo, we prove the mainstream multi-object tracker (MOT) based on Kalman Filter (KF) is unsafe even with an enabled multi-sensor fusion mechanism. Our root cause analysis reveals, the vulnerability is innate to the design of KF-based MOT, which shall error-handle the prediction results from the object detectors yet the adopted KF algorithm is prone to trust the observation more when its deviation from the prediction is larger. To address this design flaw, we propose a simple yet effective security patch for KF-based MOT, the core of which is an adaptive strategy to balance the focus of KF on observations and predictions according to the anomaly index of the observation-prediction deviation, and has certified effectiveness against a generalized hijacking attack model. Extensive evaluation on $4$ KF-based existing MOT implementations (including 2D and 3D, academic and Apollo ones) validate the defense effectiveness and the trivial performance overhead of our approach.

preprint2022arXiv

A Survey of Natural Language Generation

This paper offers a comprehensive review of the research on Natural Language Generation (NLG) over the past two decades, especially in relation to data-to-text generation and text-to-text generation deep learning methods, as well as new applications of NLG technology. This survey aims to (a) give the latest synthesis of deep learning research on the NLG core tasks, as well as the architectures adopted in the field; (b) detail meticulously and comprehensively various NLG tasks and datasets, and draw attention to the challenges in NLG evaluation, focusing on different evaluation methods and their relationships; (c) highlight some future emphasis and relatively recent research issues that arise due to the increasing synergy between NLG and other artificial intelligence areas, such as computer vision, text and computational creativity.

preprint2022arXiv

A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Text-to-SQL parsing is an essential and challenging task. The goal of text-to-SQL parsing is to convert a natural language (NL) question to its corresponding structured query language (SQL) based on the evidences provided by relational databases. Early text-to-SQL parsing systems from the database community achieved a noticeable progress with the cost of heavy human engineering and user interactions with the systems. In recent years, deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output SQL query. Subsequently, the large pre-trained language models have taken the state-of-the-art of the text-to-SQL parsing task to a new level. In this survey, we present a comprehensive review on deep learning approaches for text-to-SQL parsing. First, we introduce the text-to-SQL parsing corpora which can be categorized as single-turn and multi-turn. Second, we provide a systematical overview of pre-trained language models and existing methods for text-to-SQL parsing. Third, we present readers with the challenges faced by text-to-SQL parsing and explore some potential future directions in this field.

preprint2022arXiv

Adversarial Momentum-Contrastive Pre-Training

Recently proposed adversarial self-supervised learning methods usually require big batches and long training epochs to extract robust features, which will bring heavy computational overhead on platforms with limited resources. In order to help the network learn more powerful feature representations in smaller batches and fewer epochs, this paper proposes a novel adversarial momentum contrastive learning method, which introduces two memory banks corresponding to clean samples and adversarial samples, respectively. These memory banks can be dynamically incorporated into the training process to track invariant features among historical mini-batches. Compared with the previous adversarial pre-training model, our method achieves superior performance with smaller batch size and less training epochs. In addition, the model outperforms some state-of-the-art supervised defensive methods on multiple benchmark datasets after being fine-tuned on downstream classification tasks.

preprint2022arXiv

Cracking White-box DNN Watermarks via Invariant Neuron Transforms

Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.

preprint2022arXiv

DALG: Deep Attentive Local and Global Modeling for Image Retrieval

Deeply learned representations have achieved superior image retrieval performance in a retrieve-then-rerank manner. Recent state-of-the-art single stage model, which heuristically fuses local and global features, achieves promising trade-off between efficiency and effectiveness. However, we notice that efficiency of existing solutions is still restricted because of their multi-scale inference paradigm. In this paper, we follow the single stage art and obtain further complexity-effectiveness balance by successfully getting rid of multi-scale testing. To achieve this goal, we abandon the widely-used convolution network giving its limitation in exploring diverse visual patterns, and resort to fully attention based framework for robust representation learning motivated by the success of Transformer. Besides applying Transformer for global feature extraction, we devise a local branch composed of window-based multi-head attention and spatial attention to fully exploit local image patterns. Furthermore, we propose to combine the hierarchical local and global features via a cross-attention module, instead of using heuristically fusion as previous art does. With our Deep Attentive Local and Global modeling framework (DALG), extensive experimental results show that efficiency can be significantly improved while maintaining competitive results with the state of the arts.

preprint2022arXiv

GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection

Pre-trained models have proved to be powerful in enhancing task-oriented dialog systems. However, current pre-training methods mainly focus on enhancing dialog understanding and generation tasks while neglecting the exploitation of dialog policy. In this paper, we propose GALAXY, a novel pre-trained dialog model that explicitly learns dialog policy from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised learning. Specifically, we introduce a dialog act prediction task for policy optimization during pre-training and employ a consistency regularization term to refine the learned representation with the help of unlabeled dialogs. We also implement a gating mechanism to weigh suitable unlabeled dialog samples. Empirical results show that GALAXY substantially improves the performance of task-oriented dialog systems, and achieves new state-of-the-art results on benchmark datasets: In-Car, MultiWOZ2.0 and MultiWOZ2.1, improving their end-to-end combined scores by 2.5, 5.3 and 5.5 points, respectively. We also show that GALAXY has a stronger few-shot ability than existing models under various low-resource settings.

preprint2022arXiv

Improve Deep Image Inpainting by Emphasizing the Complexity of Missing Regions

Deep image inpainting research mainly focuses on constructing various neural network architectures or imposing novel optimization objectives. However, on the one hand, building a state-of-the-art deep inpainting model is an extremely complex task, and on the other hand, the resulting performance gains are sometimes very limited. We believe that besides the frameworks of inpainting models, lightweight traditional image processing techniques, which are often overlooked, can actually be helpful to these deep models. In this paper, we enhance the deep image inpainting models with the help of classical image complexity metrics. A knowledge-assisted index composed of missingness complexity and forward loss is presented to guide the batch selection in the training procedure. This index helps find samples that are more conducive to optimization in each iteration and ultimately boost the overall inpainting performance. The proposed approach is simple and can be plugged into many deep inpainting models by changing only a few lines of code. We experimentally demonstrate the improvements for several recently developed image inpainting models on various datasets.

preprint2022arXiv

Linking-Enhanced Pre-Training for Table Semantic Parsing

Recently pre-training models have significantly improved the performance of various NLP tasks by leveraging large-scale text corpora to improve the contextual representation ability of the neural network. The large pre-training language model has also been applied in the area of table semantic parsing. However, existing pre-training approaches have not carefully explored explicit interaction relationships between a question and the corresponding database schema, which is a key ingredient for uncovering their semantic and structural correspondence. Furthermore, the question-aware representation learning in the schema grounding context has received less attention in pre-training objective.To alleviate these issues, this paper designs two novel pre-training objectives to impose the desired inductive bias into the learned representations for table pre-training. We further propose a schema-aware curriculum learning approach to mitigate the impact of noise and learn effectively from the pre-training data in an easy-to-hard manner. We evaluate our pre-trained framework by fine-tuning it on two benchmarks, Spider and SQUALL. The results demonstrate the effectiveness of our pre-training objective and curriculum compared to a variety of baselines.

preprint2022arXiv

Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model

In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.

preprint2022arXiv

MetaV: A Meta-Verifier Approach to Task-Agnostic Model Fingerprinting

For model piracy forensics, previous model fingerprinting schemes are commonly based on adversarial examples constructed for the owner's model as the \textit{fingerprint}, and verify whether a suspect model is indeed pirated from the original model by matching the behavioral pattern on the fingerprint examples between one another. However, these methods heavily rely on the characteristics of classification tasks which inhibits their application to more general scenarios. To address this issue, we present MetaV, the first task-agnostic model fingerprinting framework which enables fingerprinting on a much wider range of DNNs independent from the downstream learning task, and exhibits strong robustness against a variety of ownership obfuscation techniques. Specifically, we generalize previous schemes into two critical design components in MetaV: the \textit{adaptive fingerprint} and the \textit{meta-verifier}, which are jointly optimized such that the meta-verifier learns to determine whether a suspect model is stolen based on the concatenated outputs of the suspect model on the adaptive fingerprint. As a key of being task-agnostic, the full process makes no assumption on the model internals in the ensemble only if they have the same input and output dimensions. Spanning classification, regression and generative modeling, extensive experimental results validate the substantially improved performance of MetaV over the state-of-the-art fingerprinting schemes and demonstrate the enhanced generality of MetaV for providing task-agnostic fingerprinting. For example, on fingerprinting ResNet-18 trained for skin cancer diagnosis, MetaV achieves simultaneously $100\%$ true positives and $100\%$ true negatives on a diverse test set of $70$ suspect models, achieving an about $220\%$ relative improvement in ARUC in comparison to the optimal baseline.

preprint2022arXiv

Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing

The importance of building text-to-SQL parsers which can be applied to new databases has long been acknowledged, and a critical step to achieve this goal is schema linking, i.e., properly recognizing mentions of unseen columns or tables when generating SQLs. In this work, we propose a novel framework to elicit relational structures from large-scale pre-trained language models (PLMs) via a probing procedure based on Poincaré distance metric, and use the induced relations to augment current graph-based parsers for better schema linking. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences, even when surface forms of mentions and entities differ. Moreover, our probing procedure is entirely unsupervised and requires no additional parameters. Extensive experiments show that our framework sets new state-of-the-art performance on three benchmarks. We empirically verify that our probing procedure can indeed find desired relational structures through qualitative analysis. Our code can be found at https://github.com/AlibabaResearch/DAMO-ConvAI.

preprint2022arXiv

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

Sequential recommender systems (SRS) have become a research hotspot due to its power in modeling user dynamic interests and sequential behavioral patterns. To maximize model expressive ability, a default choice is to apply a larger and deeper network architecture, which, however, often brings high network latency when generating online recommendations. Naturally, we argue that compressing the heavy recommendation models into middle- or light- weight neural networks is of great importance for practical production systems. To realize such a goal, we propose AdaRec, a knowledge distillation (KD) framework which compresses knowledge of a teacher model into a student model adaptively according to its recommendation scene by using differentiable Neural Architecture Search (NAS). Specifically, we introduce a target-oriented distillation loss to guide the structure search process for finding the student network architecture, and a cost-sensitive loss as constraints for model size, which achieves a superior trade-off between recommendation effectiveness and efficiency. In addition, we leverage Earth Mover's Distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate teacher layers adaptively. Extensive experiments on real-world recommendation datasets demonstrate that our model achieves competitive or better accuracy with notable inference speedup comparing to strong counterparts, while discovering diverse neural architectures for sequential recommender models under different recommendation scenes.

preprint2022arXiv

SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for Task-Oriented Dialog Understanding

Pre-training methods with contrastive learning objectives have shown remarkable success in dialog understanding tasks. However, current contrastive learning solely considers the self-augmented dialog samples as positive samples and treats all other dialog samples as negative ones, which enforces dissimilar representations even for dialogs that are semantically related. In this paper, we propose SPACE-2, a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised contrastive pre-training. Concretely, we first define a general semantic tree structure (STS) to unify the inconsistent annotation schema across different dialog datasets, so that the rich structural information stored in all labeled data can be exploited. Then we propose a novel multi-view score function to increase the relevance of all possible dialogs that share similar STSs and only push away other completely different dialogs during supervised contrastive pre-training. To fully exploit unlabeled dialogs, a basic self-supervised contrastive loss is also added to refine the learned representations. Experiments show that our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks. For reproducibility, we release the code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/space-2.

preprint2022arXiv

SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Recently, pre-training methods have shown remarkable success in task-oriented dialog (TOD) systems. However, most existing pre-trained models for TOD focus on either dialog understanding or dialog generation, but not both. In this paper, we propose SPACE-3, a novel unified semi-supervised pre-trained conversation model learning from large-scale dialog corpora with limited annotations, which can be effectively fine-tuned on a wide range of downstream dialog tasks. Specifically, SPACE-3 consists of four successive components in a single transformer to maintain a task-flow in TOD systems: (i) a dialog encoding module to encode dialog history, (ii) a dialog understanding module to extract semantic vectors from either user queries or system responses, (iii) a dialog policy module to generate a policy vector that contains high-level semantics of the response, and (iv) a dialog generation module to produce appropriate responses. We design a dedicated pre-training objective for each component. Concretely, we pre-train the dialog encoding module with span mask language modeling to learn contextualized dialog information. To capture the structured dialog semantics, we pre-train the dialog understanding module via a novel tree-induced semi-supervised contrastive learning objective with the help of extra dialog annotations. In addition, we pre-train the dialog policy module by minimizing the L2 distance between its output policy vector and the semantic vector of the response for policy optimization. Finally, the dialog generation model is pre-trained by language modeling. Results show that SPACE-3 achieves state-of-the-art performance on eight downstream dialog benchmarks, including intent prediction, dialog state tracking, and end-to-end dialog modeling. We also show that SPACE-3 has a stronger few-shot ability than existing models under the low-resource setting.

preprint2021arXiv

Relaxation and darkening of excitonic complexes in electrostatically-doped monolayer semiconductors: Roles of exciton-electron and trion-electron interactions

We present photoluminescence measurements in monolayer WSe$_2$, which point to the importance of the interaction between charged particles and excitonic complexes. The theoretical analysis highlights the key role played by exchange scattering, referring to cases wherein the particle composition of the complex changes after the interaction. For example, exchange scattering renders bright excitonic complexes dark in monolayer WSe$_2$ on accounts of the unique valley-spin configuration in this material. In addition to the ultrafast energy relaxation of hot excitonic complexes following their interaction with electrons or holes, our analysis sheds light on several key features that are commonly seen in the photoluminescence of this monolayer semiconductor. In particular, we can understand why the photoluminescence intensity of the neutral bright exciton is strongest when the monolayer is hole-doped rather than charge neutral or electron-doped. Or the reason for the dramatic increase of the photoluminescence intensity of negatively-charged excitons (trions) as soon as electrons are added to the monolayer. To self-consistently explain the findings, we further study the photoluminescence spectra at different excitation energies and analyze the behavior of the elusive indirect exciton.

preprint2020arXiv

A Generic Network Compression Framework for Sequential Recommender Systems

Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations. Current state-of-the-art sequential recommender models are typically based on a sandwich-structured deep neural network, where one or more middle (hidden) layers are placed between the input embedding layer and output softmax layer. In general, these models require a large number of parameters (such as using a large embedding dimension or a deep network architecture) to obtain their optimal performance. Despite the effectiveness, at some point, further increasing model size may be harder for model deployment in resource-constraint devices, resulting in longer responding time and larger memory footprint. To resolve the issues, we propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed. Specifically, we first propose a block-wise adaptive decomposition to approximate the input and softmax matrices by exploiting the fact that items in SRS obey a long-tailed distribution. To reduce the parameters of the middle layers, we introduce three layer-wise parameter sharing schemes. We instantiate CpRec using deep convolutional neural network with dilated kernels given consideration to both recommendation accuracy and efficiency. By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$\sim$8 times compression rates in real-world SRS datasets. Meanwhile, CpRec is faster during training\inference, and in most cases outperforms its uncompressed counterpart.

preprint2020arXiv

Amorphous Mo-Ta oxide nanotubes for long-term stable Mo oxide based supercapacitors

With a large-scale usage of portable electric appliances, a high demand for increasingly high density energy storage devices has emerged. MoO3 has, in principle, a large potential as negative electrode material in supercapacitive devices, due to high charge densities that can be obtained from its reversible redox reactions. Nevertheless, the extremely poor electrochemical stability of MoO3 in aqueous electrolytes prevents a practical use in high capacitance devices. In this work, we describe how to overcome this severe stability issue by forming amorphous molybdenum oxide/tantalum oxide nanotubes by anodic oxidation of a Mo-Ta alloy. The presence of a critical amount of Ta-oxide (> 20 at-%) prevents the electrochemical decay of the MoO3 phase and thus yields an extremely high stability. Due to the protection provided by tantalum oxide, no capacitance losses are measureable after 10000 charg-ing/discharging cycles.

preprint2020arXiv

Discovering Protagonist of Sentiment with Aspect Reconstructed Capsule Network

Most recent existing aspect-term level sentiment analysis (ATSA) approaches combined various neural network models with delicately carved attention mechanisms built upon given aspect and context to generate refined sentence representations for better predictions. In these methods, aspect terms are always provided in both training and testing process which may degrade aspect-level analysis into sentence-level prediction. However, the annotated aspect term might be unavailable in real-world scenarios which may challenge the applicability of the existing methods. In this paper, we aim to improve ATSA by discovering the potential aspect terms of the predicted sentiment polarity when the aspect terms of a test sentence are unknown. We access this goal by proposing a capsule network based model named CAPSAR. In CAPSAR, sentiment categories are denoted by capsules and aspect term information is injected into sentiment capsules through a sentiment-aspect reconstruction procedure during the training. As a result, coherent patterns between aspects and sentimental expressions are encapsulated by these sentiment capsules. Experiments on three widely used benchmarks demonstrate these patterns have potential in exploring aspect terms from test sentence when only feeding the sentence to the model. Meanwhile, the proposed CAPSAR can clearly outperform SOTA methods in standard ATSA tasks.

preprint2020arXiv

Empirical Evaluation of Multi-task Learning in Deep Neural Networks for Natural Language Processing

Multi-Task Learning (MTL) aims at boosting the overall performance of each individual task by leveraging useful information contained in multiple related tasks. It has shown great success in natural language processing (NLP). Currently, a number of MLT architectures and learning mechanisms have been proposed for various NLP tasks. However, there is no systematic exploration and comparison of different MLT architectures and learning mechanisms for their strong performance in-depth. In this paper, we conduct a thorough examination of typical MTL methods on a broad range of representative NLP tasks. Our primary goal is to understand the merits and demerits of existing MTL methods in NLP tasks, thus devising new hybrid architectures intended to combine their strengths.

preprint2020arXiv

Measurement of Conduction and Valence Bands g-factors in a Transition Metal Dichalcogenide Monolayer

The electron valley and spin degree of freedom in monolayer transition-metal dichalcogenides can be manipulated in optical and transport measurements performed in magnetic fields. The key parameter for determining the Zeeman splitting, namely the separate contribution of the electron and hole g-factor, is inaccessible in most measurements. Here we present an original method that gives access to the respective contribution of the conduction and valence band to the measured Zeeman splitting. It exploits the optical selection rules of exciton complexes, in particular the ones involving inter-valley phonons, avoiding strong renormalization effects that compromise single particle g-factor determination in transport experiments. These studies yield a direct determination of single band g factors. We measure gc1= 0.86, gc2=3.84 for the bottom (top) conduction bands and gv=6.1 for the valence band of monolayer WSe2. These measurements are helpful for quantitative interpretation of optical and transport measurements performed in magnetic fields. In addition the measured g-factors are valuable input parameters for optimizing band structure calculations of these 2D materials.

preprint2020arXiv

Semi-Supervised Recognition under a Noisy and Fine-grained Dataset

Simi-Supervised Recognition Challenge-FGVC7 is a challenging fine-grained recognition competition. One of the difficulties of this competition is how to use unlabeled data. We adopted pseudo-tag data mining to increase the amount of training data. The other one is how to identify similar birds with a very small difference, especially those have a relatively tiny main-body in examples. We combined generic image recognition and fine-grained image recognition method to solve the problem. All generic image recognition models were training using PaddleClas . Using the combination of two different ways of deep recognition models, we finally won the third place in the competition.

preprint2020arXiv

Valley Phonons and Exciton Complexes in a Monolayer Semiconductor

The coupling between spin, charge, and lattice degrees of freedom plays an important role in a wide range of fundamental phenomena. Monolayer semiconducting transitional metal dichalcogenides have emerged as an outstanding platform for studying these coupling effects because they possess unique spin-valley locking physics for hosting rich excitonic species and the reduced screening for strong Coulomb interactions. Here, we report the observation of multiple valley phonons, phonons with momentum vectors pointing to the corners of the hexagonal Brillouin zone, and the resulting exciton complexes in the monolayer semiconductor WSe2. From Lande g-factor and polarization analyses of photoluminescence peaks, we find that these valley phonons lead to efficient intervalley scattering of quasi particles in both exciton formation and relaxation. This leads to a series of photoluminescence peaks as valley phonon replicas of dark trions. Using identified valley phonons, we also uncovered an intervalley exciton near charge neutrality, and extract its short-range electron-hole exchange interaction to be about 10 meV. Our work not only identifies a number of previously unknown 2D excitonic species, but also shows that monolayer WSe2 is a prime candidate for studying interactions between spin, pseudospin, and zone-edge phonons.

preprint2020arXiv

Weak Links in Authentication Chains: A Large-scale Analysis of Email Sender Spoofing Attacks

As a fundamental communicative service, email is playing an important role in both individual and corporate communications, which also makes it one of the most frequently attack vectors. An email's authenticity is based on an authentication chain involving multiple protocols, roles and services, the inconsistency among which creates security threats. Thus, it depends on the weakest link of the chain, as any failed part can break the whole chain-based defense. This paper systematically analyzes the transmission of an email and identifies a series of new attacks capable of bypassing SPF, DKIM, DMARC and user-interface protections. In particular, by conducting a "cocktail" joint attack, more realistic emails can be forged to penetrate the celebrated email services, such as Gmail and Outlook. We conduct a large-scale experiment on 30 popular email services and 23 email clients, and find that all of them are vulnerable to certain types of new attacks. We have duly reported the identified vulnerabilities to the related email service providers, and received positive responses from 11 of them, including Gmail, Yahoo, iCloud and Alibaba. Furthermore, we propose key mitigating measures to defend against the new attacks. Therefore, this work is of great value for identifying email spoofing attacks and improving the email ecosystem's overall security.

preprint2019arXiv

Exciton valley depolarization in monolayer transition-metal dichalcogenides

The valley degree of freedom is a sought-after quantum number in monolayer transition-metal dichalcogenides. Similar to optical spin orientation in semiconductors, the helicity of absorbed photons can be relayed to the valley (pseudospin) quantum number of photoexcited electrons and holes. Also similar to the quantum-mechanical spin, the valley quantum number is not a conserved quantity. Valley depolarization of excitons in monolayer transition-metal dichalcogenides due to long-range electron-hole exchange typically takes a few ps at low temperatures. Exceptions to this behavior are monolayers MoSe$_2$ and MoTe$_2$ wherein the depolarization is much faster. We elucidate the enigmatic anomaly of these materials, finding that it originates from Rashba-induced coupling of the dark and bright exciton branches next to their degeneracy point. When photoexcited excitons scatter during their energy relaxation between states next to the degeneracy region, they reach the light cone after losing the initial helicity. The valley depolarization is not as fast in monolayers WSe$_2$, WS$_2$ and likely MoS$_2$ wherein the Rashba-induced coupling is negligible.

preprint2016arXiv

Acoustic Coherent Perfect Absorbers as Sensitive Null Detectors

We report the experimental realization of acoustic coherent perfect absorption (CPA) of four symmetric scatterers of very different structures. The only conditions necessary for these scatterers to exhibit CPA are that both the reflection and transmission amplitudes of the scatterers are 0.5 under one incident wave, and there are two collinear and counter-propagating incident waves with appropriate relative amplitude and phase. Nearly 1000 times in the modulation of output power has been demonstrated by changing the relative phase of the incident waves over 180°. We further demonstrate that these scatterers are sensitive devices to detect the small differences between two nearly equal incident waves. A 27 % change in the strength of the scattering wave has been demonstrated for every degree of phase deviation from the optimum condition between the incident waves.

preprint2016arXiv

Context-aware System Service Call-oriented Symbolic Execution of Android Framework with Application to Exploit Generation

Android Framework is a layer of software that exists in every Android system managing resources of all Android apps. A vulnerability in Android Framework can lead to severe hacks, such as destroying user data and leaking private information. With tens of millions of Android devices unpatched due to Android fragmentation, vulnerabilities in Android Framework certainly attract attackers to exploit them. So far, enormous manual effort is needed to craft such exploits. To our knowledge, no research has been done on automatic generation of exploits that take advantage of Android Framework vulnerabilities. We make a first step towards this goal by applying symbolic execution of Android Framework to finding bugs and generating exploits. Several challenges have been raised by the task. (1) The information of an app flows to Android Framework in multiple intricate steps, making it difficult to identify symbolic inputs. (2) Android Framework has a complex initialization phase, which exacerbates the state space explosion problem. (3) A straightforward design that builds the symbolic executor as a layer inside the Android system will not work well: not only does the implementation have to ensure the compatibility with the Android system, but it needs to be maintained whenever Android gets updated. We present novel ideas and techniques to resolve the challenges, and have built the first system for symbolic execution of Android Framework. It fundamentally changes the state of the art in exploit generation on the Android system, and has been applied to constructing new techniques for finding vulnerabilities.

preprint2016arXiv

Local unitary equivalence of quantum states and simultaneous orthogonal equivalence

The correspondence between local unitary equivalence of bipartite quantum states and simultaneous orthogonal equivalence is thoroughly investigated and strengthened. It is proved that local unitary equivalence can be studied through simultaneous similarity under projective orthogonal transformations, and four parametrization independent algorithms are proposed to judge when two density matrices on $\mathbb C^{d_1}\otimes \mathbb C^{d_2}$ are locally unitary equivalent in connection with trace identities, Weierstrass pencils, Albert determinants and Smith normal forms.

preprint2016arXiv

Self-ordered Mo-oxide Nanotube Arrays as Precursor for Aligned MoOx/MoS2 Core-Shell Nanotubular Structures with a High Density of Reactive Sites

In the present work we demonstrate the self-organized formation of anodic Mo-oxide nanotube arrays grown on a Mo sheet under suitable electrochemical conditions in glycerol/NH4F electrolytes. The resulting amorphous tubes can be crystallized by annealing to MoO2 or MoO3. The tube walls then can be further sulfurized fully or partially to Mo-sulfide to form well-ordered arrays of vertically aligned MoOx/MoS2 nanotubes. Under optimized conditions, defined MoS2 sheets form on the oxide walls in a layer by layer low angle zig-zag arrangement that provide a high density of reactive stacking faults. These core-shell nanotube arrays, consisting of tubes with a conductive suboxide core and a functional high defect density MoS2 coating, are highly promising for applications such as electrocatalysis (hydrogen evolution) or ion insertion devices.

preprint2016arXiv

Telepresence Interaction by Touching Live Video Images

This paper presents a telepresence interaction framework based on touchscreen and telepresence-robot technologies. The core of the framework is a new user interface, Touchable live video Image based User Interface, called TIUI. The TIUI allows a remote operator to not just drive the telepresence robot but operate and interact with real objects by touching their live video images on a pad with finger touch gestures. We implemented a telepresence interaction system which is composed of a telepresence robot and tele-interactive objects located in a local space, the TIUI of a pad located in a remote space, and the wireless networks connecting the two spaces. Our system can be a perfect embodiment of a remote operator to do most of daily living tasks, such as opening a door, drawing a curtain, pushing a wheelchair, and other like tasks. The evaluation and demonstration results show the effectiveness and promising applications of our system.

preprint2015arXiv

Ordering-sensitive and Semantic-aware Topic Modeling

Topic modeling of textual corpora is an important and challenging problem. In most previous work, the "bag-of-words" assumption is usually made which ignores the ordering of words. This assumption simplifies the computation, but it unrealistically loses the ordering information and the semantic of words in the context. In this paper, we present a Gaussian Mixture Neural Topic Model (GMNTM) which incorporates both the ordering of words and the semantic meaning of sentences into topic modeling. Specifically, we represent each topic as a cluster of multi-dimensional vectors and embed the corpus into a collection of vectors generated by the Gaussian mixture model. Each word is affected not only by its topic, but also by the embedding vector of its surrounding words and the context. The Gaussian mixture components and the topic of documents, sentences and words can be learnt jointly. Extensive experiments show that our model can learn better topics and more accurate word distributions for each topic. Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM obtains significantly better performance in terms of perplexity, retrieval accuracy and classification accuracy.

preprint2015arXiv

Sound Absorption by Subwavelength Membrane Structures: A Generalized Perspective

Decorated membrane, comprising a thin layer of elastic film with small rigid platelets fixed on top, has been found to be an efficient absorber of low frequency sound. In this work we consider the problem of sound absorption from a perspective aimed at deriving upper bounds under different scenarios, i.e., whether the sound is incident from one side only or from both sides, and whether there is a reflecting surface on the back side of the membrane. By considering the negligible thickness of the membrane, usually on the order of a fraction of one millimeter, we derive a relation showing that the sum of the incoming sound waves' (complex) pressure amplitudes, averaged over the area of the membrane, must be equal to that of the outgoing waves. By using this relation, and without going to any details of the wave solutions, it is shown that the maximum absorption achievable from one-side incident is 50%, while the maximum absorption with a back reflecting surface can reach 100%. The latter was attained by the hybridized resonances. All the results are shown to be in excellent agreement with the experiments. This generalized perspective, when used together with the Green function formalism, can be useful in gaining insights and delineating the constraints on what are achievable in scatterings and absorption by thin film structures.

preprint2015arXiv

Subwavelength total acoustic absorption with degenerate resonators

We report the experimental realization of perfect sound absorption by sub-wavelength monopole and dipole resonators that exhibit degenerate resonant frequencies. This is achieved through the destructive interference of two resonators' transmission responses, while the matching of their averaged impedances to that of air implies no backscattering, thereby leading to total absorption. Two examples, both using decorated membrane resonators (DMRs) as the basic units, are presented. The first is a flat panel comprising a DMR and a pair of coupled DMRs, while the second one is a ventilated short tube containing a DMR in conjunction with a sidewall DMR backed by a cavity. In both examples, near perfect absorption, up to 99.7%, has been observed with the airborne wavelength up to 1.2 m, which is at least an order of magnitude larger than the composite absorber. Excellent agreement between theory and experiment is obtained.

preprint2014arXiv

Saturated locally optimal designs under differentiable optimality criteria

We develop general theory for finding locally optimal designs in a class of single-covariate models under any differentiable optimality criterion. Yang and Stufken [Ann. Statist. 40 (2012) 1665-1681] and Dette and Schorning [Ann. Statist. 41 (2013) 1260-1267] gave complete class results for optimal designs under such models. Based on their results, saturated optimal designs exist; however, how to find such designs has not been addressed. We develop tools to find saturated optimal designs, and also prove their uniqueness under mild conditions.

preprint2013arXiv

Angular Reconstruction of a Lead Scintillating-Fiber Sandwiched Electromagnetic Calorimeter

A new method called Neighbor Cell Deposited Energy Ratio (NCDER) is proposed to reconstruct incidence position in a single layer for a 3-dimensional imaging electromagnetic calorimeter (ECAL).This method was applied to reconstruct the ECAL test beam data for the Alpha Magnetic Spectrometer-02 (AMS-02). The results show that this method can achieve an angular resolution of 7.36\pm 0.08 / \sqrt(E) \oplus 0.28 \pm 0.02 degree in the determination of the photons direction, which is much more precise than that obtained with the commonly-adopted Center of Gravity(COG) method (8.4 \pm 0.1 /sqrt(E) \oplus 0.8\pm0.3 degree). Furthermore, since it uses only the properties of electromagnetic showers, this new method could also be used for other type of fine grain sampling calorimeters.

preprint2013arXiv

Automatic Calibration Method of Voxel Size for Cone-beam 3D-CT Scanning System

For cone-beam three-dimensional computed tomography (3D-CT) scanning system, voxel size is an important indicator to guarantee the accuracy of data analysis and feature measurement based on 3D-CT images. Meanwhile, the voxel size changes with the movement of the rotary table along X-ray direction. In order to realize the automatic calibration of the voxel size, a new easily-implemented method is proposed. According to this method, several projections of a spherical phantom are captured at different imaging positions and the corresponding voxel size values are calculated by non-linear least square fitting. Through these interpolation values, a linear equation is obtained, which reflects the relationship between the rotary table displacement distance from its nominal zero position and the voxel size. Finally, the linear equation is imported into the calibration module of the 3D-CT scanning system, and when the rotary table is moving along X-ray direction, the accurate value of the voxel size is dynamically exported. The experimental results prove that this method meets the requirements of the actual CT scanning system, and has virtues of easy implementation and high accuracy

preprint2012arXiv

Identifying locally optimal designs for nonlinear models: A simple extension with profound consequences

We extend the approach in [Ann. Statist. 38 (2010) 2499-2524] for identifying locally optimal designs for nonlinear models. Conceptually the extension is relatively simple, but the consequences in terms of applications are profound. As we will demonstrate, we can obtain results for locally optimal designs under many optimality criteria and for a larger class of models than has been done hitherto. In many cases the results lead to optimal designs with the minimal number of support points.

Min Yang

What is connected

Connect this record

See the researcher in context

Building this map preview

42 published item(s)

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Valley3: Scaling Omni Foundation Models for E-commerce

DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever

Unifying Structured Data as Graph for Data-to-Text Pre-Training

A Certifiable Security Patch for Object Tracking in Self-Driving Systems via Historical Deviation Modeling

A Survey of Natural Language Generation

A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Adversarial Momentum-Contrastive Pre-Training

Cracking White-box DNN Watermarks via Invariant Neuron Transforms

DALG: Deep Attentive Local and Global Modeling for Image Retrieval

GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection

Improve Deep Image Inpainting by Emphasizing the Complexity of Missing Regions

Linking-Enhanced Pre-Training for Table Semantic Parsing

Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model

MetaV: A Meta-Verifier Approach to Task-Agnostic Model Fingerprinting

Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for Task-Oriented Dialog Understanding

SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Relaxation and darkening of excitonic complexes in electrostatically-doped monolayer semiconductors: Roles of exciton-electron and trion-electron interactions

A Generic Network Compression Framework for Sequential Recommender Systems

Amorphous Mo-Ta oxide nanotubes for long-term stable Mo oxide based supercapacitors

Discovering Protagonist of Sentiment with Aspect Reconstructed Capsule Network

Empirical Evaluation of Multi-task Learning in Deep Neural Networks for Natural Language Processing

Measurement of Conduction and Valence Bands g-factors in a Transition Metal Dichalcogenide Monolayer

Semi-Supervised Recognition under a Noisy and Fine-grained Dataset

Valley Phonons and Exciton Complexes in a Monolayer Semiconductor

Weak Links in Authentication Chains: A Large-scale Analysis of Email Sender Spoofing Attacks

Exciton valley depolarization in monolayer transition-metal dichalcogenides

Acoustic Coherent Perfect Absorbers as Sensitive Null Detectors

Context-aware System Service Call-oriented Symbolic Execution of Android Framework with Application to Exploit Generation

Local unitary equivalence of quantum states and simultaneous orthogonal equivalence

Self-ordered Mo-oxide Nanotube Arrays as Precursor for Aligned MoOx/MoS2 Core-Shell Nanotubular Structures with a High Density of Reactive Sites

Telepresence Interaction by Touching Live Video Images

Ordering-sensitive and Semantic-aware Topic Modeling

Sound Absorption by Subwavelength Membrane Structures: A Generalized Perspective

Subwavelength total acoustic absorption with degenerate resonators

Saturated locally optimal designs under differentiable optimality criteria

Angular Reconstruction of a Lead Scintillating-Fiber Sandwiched Electromagnetic Calorimeter

Automatic Calibration Method of Voxel Size for Cone-beam 3D-CT Scanning System

Identifying locally optimal designs for nonlinear models: A simple extension with profound consequences