Source author record

Diego Klabjan

Diego Klabjan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Information Retrieval Computation and Language math.OC Artificial Intelligence Computer Vision cs.CY Human-Computer Interaction Social and Information Networks Applications Cryptography and Security math.ST Neural and Evolutionary Computing

Catalog footprint

What is connected

32works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Model-based Bootstrap of Controlled Markov Chains

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

preprint2023arXiv

Robust softmax aggregation on blockchain based federated learning with convergence guarantee

Blockchain based federated learning is a distributed learning scheme that allows model training without participants sharing their local data sets, where the blockchain components eliminate the need for a trusted central server compared to traditional Federated Learning algorithms. In this paper we propose a softmax aggregation blockchain based federated learning framework. First, we propose a new blockchain based federated learning architecture that utilizes the well-tested proof-of-stake consensus mechanism on an existing blockchain network to select validators and miners to aggregate the participants' updates and compute the blocks. Second, to ensure the robustness of the aggregation process, we design a novel softmax aggregation method based on approximated population loss values that relies on our specific blockchain architecture. Additionally, we show our softmax aggregation technique converges to the global minimum in the convex setting with non-restricting assumptions. Our comprehensive experiments show that our framework outperforms existing robust aggregation algorithms in various settings by large margins.

preprint2022arXiv

Open-Set Recognition of Breast Cancer Treatments

Open-set recognition generalizes a classification task by classifying test samples as one of the known classes from training or "unknown." As novel cancer drug cocktails with improved treatment are continually discovered, predicting cancer treatments can naturally be formulated in terms of an open-set recognition problem. Drawbacks, due to modeling unknown samples during training, arise from straightforward implementations of prior work in healthcare open-set learning. Accordingly, we reframe the problem methodology and apply a recent existing Gaussian mixture variational autoencoder model, which achieves state-of-the-art results for image datasets, to breast cancer patient data. Not only do we obtain more accurate and robust classification results, with a 24.5% average F1 increase compared to a recent method, but we also reexamine open-set recognition in terms of deployability to a clinical setting.

preprint2022arXiv

Topic Analysis for Text with Side Data

Although latent factor models (e.g., matrix factorization) obtain good performance in predictions, they suffer from several problems including cold-start, non-transparency, and suboptimal recommendations. In this paper, we employ text with side data to tackle these limitations. We introduce a hybrid generative probabilistic model that combines a neural network with a latent topic model, which is a four-level hierarchical Bayesian model. In the model, each document is modeled as a finite mixture over an underlying set of topics and each topic is modeled as an infinite mixture over an underlying set of topic probabilities. Furthermore, each topic probability is modeled as a finite mixture over side data. In the context of text, the neural network provides an overview distribution about side data for the corresponding text, which is the prior distribution in LDA to help perform topic grouping. The approach is evaluated on several different datasets, where the model is shown to outperform standard LDA and Dirichlet-multinomial regression (DMR) in terms of topic grouping, model perplexity, classification and comment generation.

preprint2022arXiv

Tricks and Plugins to GBM on Images and Sequences

Convolutional neural networks (CNNs) and transformers, which are composed of multiple processing layers and blocks to learn the representations of data with multiple abstract levels, are the most successful machine learning models in recent years. However, millions of parameters and many blocks make them difficult to be trained, and sometimes several days or weeks are required to find an ideal architecture or tune the parameters. Within this paper, we propose a new algorithm for boosting Deep Convolutional Neural Networks (BoostCNN) to combine the merits of dynamic feature selection and BoostCNN, and another new family of algorithms combining boosting and transformers. To learn these new models, we introduce subgrid selection and importance sampling strategies and propose a set of algorithms to incorporate boosting weights into a deep learning architecture based on a least squares objective function. These algorithms not only reduce the required manual effort for finding an appropriate network architecture but also result in superior performance and lower running time. Experiments show that the proposed methods outperform benchmarks on several fine-grained classification tasks.

preprint2021arXiv

Classification Models for Partially Ordered Sequences

Many models such as Long Short Term Memory (LSTMs), Gated Recurrent Units (GRUs) and transformers have been developed to classify time series data with the assumption that events in a sequence are ordered. On the other hand, fewer models have been developed for set based inputs, where order does not matter. There are several use cases where data is given as partially-ordered sequences because of the granularity or uncertainty of time stamps. We introduce a novel transformer based model for such prediction tasks, and benchmark against extensions of existing order invariant models. We also discuss how transition probabilities between events in a sequence can be used to improve model performance. We show that the transformer-based equal-time model outperforms extensions of existing set models on three data sets.

preprint2021arXiv

Layer Flexible Adaptive Computational Time

Deep recurrent neural networks perform well on sequence data and are the model of choice. However, it is a daunting task to decide the structure of the networks, i.e. the number of layers, especially considering different computational needs of a sequence. We propose a layer flexible recurrent neural network with adaptive computation time, and expand it to a sequence to sequence model. Different from the adaptive computation time model, our model has a dynamic number of transmission states which vary by step and sequence. We evaluate the model on a financial data set and Wikipedia language modeling. Experimental results show the performance improvement of 7\% to 12\% and indicate the model's ability to dynamically change the number of layers along with the computational steps.

preprint2021arXiv

Predicting Shot Making in Basketball Learnt from Adversarial Multiagent Trajectories

In this paper, we predict the likelihood of a player making a shot in basketball from multiagent trajectories. Previous approaches to similar problems center on hand-crafting features to capture domain specific knowledge. Although intuitive, recent work in deep learning has shown this approach is prone to missing important predictive features. To circumvent this issue, we present a convolutional neural network (CNN) approach where we initially represent the multiagent behavior as an image. To encode the adversarial nature of basketball, we use a multi-channel image which we then feed into a CNN. Additionally, to capture the temporal aspect of the trajectories we "fade" the player trajectories. We find that this approach is superior to a traditional FFN model. By using gradient ascent to create images using an already trained CNN, we discover what features the CNN filters learn. Last, we find that a combined CNN+FFN is the best performing network with an error rate of 39%.

preprint2020arXiv

A Simple and Fast Algorithm for L1-norm Kernel PCA

We present an algorithm for L1-norm kernel PCA and provide a convergence analysis for it. While an optimal solution of L2-norm kernel PCA can be obtained through matrix decomposition, finding that of L1-norm kernel PCA is not trivial due to its non-convexity and non-smoothness. We provide a novel reformulation through which an equivalent, geometrically interpretable problem is obtained. Based on the geometric interpretation of the reformulated problem, we present a fixed-point type algorithm that iteratively computes a binary weight for each observation. As the algorithm requires only inner products of data vectors, it is computationally efficient and the kernel trick is applicable. In the convergence analysis, we show that the algorithm converges to a local optimal solution in a finite number of steps. Moreover, we provide a rate of convergence analysis, which has been never done for any L1-norm PCA algorithm, proving that the sequence of objective values converges at a linear rate. In numerical experiments, we show that the algorithm is robust in the presence of entry-wise perturbations and computationally scalable, especially in a large-scale setting. Lastly, we introduce an application to outlier detection where the model based on the proposed algorithm outperforms the benchmark algorithms.

preprint2020arXiv

Efficient Architecture Search for Continual Learning

Continual learning with neural networks is an important learning framework in AI that aims to learn a sequence of tasks well. However, it is often confronted with three challenges: (1) overcome the catastrophic forgetting problem, (2) adapt the current network to new tasks, and meanwhile (3) control its model complexity. To reach these goals, we propose a novel approach named as Continual Learning with Efficient Architecture Search, or CLEAS in short. CLEAS works closely with neural architecture search (NAS) which leverages reinforcement learning techniques to search for the best neural architecture that fits a new task. In particular, we design a neuron-level NAS controller that decides which old neurons from previous tasks should be reused (knowledge transfer), and which new neurons should be added (to learn new knowledge). Such a fine-grained controller allows one to find a very concise architecture that can fit each new task well. Meanwhile, since we do not alter the weights of the reused neurons, we perfectly memorize the knowledge learned from previous tasks. We evaluate CLEAS on numerous sequential classification tasks, and the results demonstrate that CLEAS outperforms other state-of-the-art alternative methods, achieving higher classification accuracy while using simpler neural architectures.

preprint2020arXiv

Keyword-based Topic Modeling and Keyword Selection

Certain type of documents such as tweets are collected by specifying a set of keywords. As topics of interest change with time it is beneficial to adjust keywords dynamically. The challenge is that these need to be specified ahead of knowing the forthcoming documents and the underlying topics. The future topics should mimic past topics of interest yet there should be some novelty in them. We develop a keyword-based topic model that dynamically selects a subset of keywords to be used to collect future documents. The generative process first selects keywords and then the underlying documents based on the specified keywords. The model is trained by using a variational lower bound and stochastic gradient optimization. The inference consists of finding a subset of keywords where given a subset the model predicts the underlying topic-word matrix for the unknown forthcoming documents. We compare the keyword topic model against a benchmark model using viral predictions of tweets combined with a topic model. The keyword-based topic model outperforms this sophisticated baseline model by 67%.

preprint2020arXiv

Listwise Learning to Rank by Exploring Unique Ratings

In this paper, we propose new listwise learning-to-rank models that mitigate the shortcomings of existing ones. Existing listwise learning-to-rank models are generally derived from the classical Plackett-Luce model, which has three major limitations. (1) Its permutation probabilities overlook ties, i.e., a situation when more than one document has the same rating with respect to a query. This can lead to imprecise permutation probabilities and inefficient training because of selecting documents one by one. (2) It does not favor documents having high relevance. (3) It has a loose assumption that sampling documents at different steps is independent. To overcome the first two limitations, we model ranking as selecting documents from a candidate set based on unique rating levels in decreasing order. The number of steps in training is determined by the number of unique rating levels. We propose a new loss function and associated four models for the entire sequence of weighted classification tasks by assigning high weights to the selected documents with high ratings for optimizing Normalized Discounted Cumulative Gain (NDCG). To overcome the final limitation, we further propose a novel and efficient way of refining prediction scores by combining an adapted Vanilla Recurrent Neural Network (RNN) model with pooling given selected documents at previous steps. We encode all of the documents already selected by an RNN model. In a single step, we rank all of the documents with the same ratings using the last cell of the RNN multiple times. We have implemented our models using three settings: neural networks, neural networks with gradient boosting, and regression trees with gradient boosting. We have conducted experiments on four public datasets. The experiments demonstrate that the models notably outperform state-of-the-art learning-to-rank models.

preprint2020arXiv

Mixture-based Multiple Imputation Model for Clinical Data with a Temporal Dimension

The problem of missing values in multivariable time series is a key challenge in many applications such as clinical data mining. Although many imputation methods show their effectiveness in many applications, few of them are designed to accommodate clinical multivariable time series. In this work, we propose a multiple imputation model that capture both cross-sectional information and temporal correlations. We integrate Gaussian processes with mixture models and introduce individualized mixing weights to handle the variance of predictive confidence of Gaussian process models. The proposed model is compared with several state-of-the-art imputation algorithms on both real-world and synthetic datasets. Experiments show that our best model can provide more accurate imputation than the benchmarks on all of our datasets.

preprint2020arXiv

Neural Network Retraining for Model Serving

We propose incremental (re)training of a neural network model to cope with a continuous flow of new data in inference during model serving. As such, this is a life-long learning process. We address two challenges of life-long retraining: catastrophic forgetting and efficient retraining. If we combine all past and new data it can easily become intractable to retrain the neural network model. On the other hand, if the model is retrained using only new data, it can easily suffer catastrophic forgetting and thus it is paramount to strike the right balance. Moreover, if we retrain all weights of the model every time new data is collected, retraining tends to require too many computing resources. To solve these two issues, we propose a novel retraining model that can select important samples and important weights utilizing multi-armed bandits. To further address forgetting, we propose a new regularization term focusing on synapse and neuron importance. We analyze multiple datasets to document the outcome of the proposed retraining methods. Various experiments demonstrate that our retraining methodologies mitigate the catastrophic forgetting problem while boosting model performance.

preprint2020arXiv

Open Set Domain Adaptation by Extreme Value Theory

Common domain adaptation techniques assume that the source domain and the target domain share an identical label space, which is problematic since when target samples are unlabeled we have no knowledge on whether the two domains share the same label space. When this is not the case, the existing methods fail to perform well because the additional unknown classes are also matched with the source domain during adaptation. In this paper, we tackle the open set domain adaptation problem under the assumption that the source and the target label spaces only partially overlap, and the task becomes when the unknown classes exist, how to detect the target unknown classes and avoid aligning them with the source domain. We propose to utilize an instance-level reweighting strategy for domain adaptation where the weights indicate the likelihood of a sample belonging to known classes and to model the tail of the entropy distribution with Extreme Value Theory for unknown class detection. Experiments on conventional domain adaptation datasets show that the proposed method outperforms the state-of-the-art models.

preprint2020arXiv

Open-Set Recognition with Gaussian Mixture Variational Autoencoders

In inference, open-set classification is to either classify a sample into a known class from training or reject it as an unknown class. Existing deep open-set classifiers train explicit closed-set classifiers, in some cases disjointly utilizing reconstruction, which we find dilutes the latent representation's ability to distinguish unknown classes. In contrast, we train our model to cooperatively learn reconstruction and perform class-based clustering in the latent space. With this, our Gaussian mixture variational autoencoder (GMVAE) achieves more accurate and robust open-set classification results, with an average F1 improvement of 29.5%, through extensive experiments aided by analytical results.

preprint2020arXiv

Scale Invariant Power Iteration

Power iteration has been generalized to solve many interesting problems in machine learning and statistics. Despite its striking success, theoretical understanding of when and how such an algorithm enjoys good convergence property is limited. In this work, we introduce a new class of optimization problems called scale invariant problems and prove that they can be efficiently solved by scale invariant power iteration (SCI-PI) with a generalized convergence guarantee of power iteration. By deriving that a stationary point is an eigenvector of the Hessian evaluated at the point, we show that scale invariant problems indeed resemble the leading eigenvector problem near a local optimum. Also, based on a novel reformulation, we geometrically derive SCI-PI which has a general form of power iteration. The convergence analysis shows that SCI-PI attains local linear convergence with a rate being proportional to the top two eigenvalues of the Hessian at the optimum. Moreover, we discuss some extended settings of scale invariant problems and provide similar convergence results for them. In numerical experiments, we introduce applications to independent component analysis, Gaussian mixtures, and non-negative matrix factorization. Experimental results demonstrate that SCI-PI is competitive to state-of-the-art benchmark algorithms and often yield better solutions.

preprint2020arXiv

Subset Selection for Multiple Linear Regression via Optimization

Subset selection in multiple linear regression aims to choose a subset of candidate explanatory variables that tradeoff fitting error (explanatory power) and model complexity (number of variables selected). We build mathematical programming models for regression subset selection based on mean square and absolute errors, and minimal-redundancy-maximal-relevance criteria. The proposed models are tested using a linear-program-based branch-and-bound algorithm with tailored valid inequalities and big M values and are compared against the algorithms in the literature. For high dimensional cases, an iterative heuristic algorithm is proposed based on the mathematical programming models and a core set concept, and a randomized version of the algorithm is derived to guarantee convergence to the global optimum. From the computational experiments, we find that our models quickly find a quality solution while the rest of the time is spent to prove optimality; the iterative algorithms find solutions in a relatively short time and are competitive compared to state-of-the-art algorithms; using ad-hoc big M values is not recommended.

preprint2020arXiv

The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent

The mini-batch stochastic gradient descent (SGD) algorithm is widely used in training machine learning models, in particular deep learning models. We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks, by focusing on the variance of the gradients, which is the first study of this nature. In the linear regression case, we show that in each iteration the norm of the gradient is a decreasing function of the mini-batch size $b$ and thus the variance of the stochastic gradient estimator is a decreasing function of $b$. For deep neural networks with $L_2$ loss we show that the variance of the gradient is a polynomial in $1/b$. The results back the important intuition that smaller batch sizes yield lower loss function values which is a common believe among the researchers. The proof techniques exhibit a relationship between stochastic gradient estimators and initial weights, which is useful for further research on the dynamics of SGD. We empirically provide further insights to our results on various datasets and commonly used deep network structures.

preprint2017arXiv

Semantic Document Distance Measures and Unsupervised Document Revision Detection

In this paper, we model the document revision detection problem as a minimum cost branching problem that relies on computing document distances. Furthermore, we propose two new document distance measures, word vector-based Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED). Our revision detection system is designed for a large scale corpus and implemented in Apache Spark. We demonstrate that our system can more precisely detect revisions than state-of-the-art methods by utilizing the Wikipedia revision dumps https://snap.stanford.edu/data/wiki-meta.html and simulated data sets.

preprint2017arXiv

Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling

In this paper, we present hierarchical relationbased latent Dirichlet allocation (hrLDA), a data-driven hierarchical topic model for extracting terminological ontologies from a large number of heterogeneous documents. In contrast to traditional topic models, hrLDA relies on noun phrases instead of unigrams, considers syntax and document structures, and enriches topic hierarchies with topic relations. Through a series of experiments, we demonstrate the superiority of hrLDA over existing topic models, especially for building hierarchies. Furthermore, we illustrate the robustness of hrLDA in the settings of noisy data sets, which are likely to occur in many practical scenarios. Our ontology evaluation results show that ontologies extracted from hrLDA are very competitive with the ontologies created by domain experts.

preprint2016arXiv

A Semi-supervised learning approach to enhance health care Community-based Question Answering: A case study in alcoholism

Community-based Question Answering (CQA) sites play an important role in addressing health information needs. However, a significant number of posted questions remain unanswered. Automatically answering the posted questions can provide a useful source of information for online health communities. In this study, we developed an algorithm to automatically answer health-related questions based on past questions and answers (QA). We also aimed to understand information embedded within online health content that are good features in identifying valid answers. Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA. In order to rank these candidates, we implemented a semi-supervised leaning algorithm that extracts the best answer to a question. We assessed this approach on a curated corpus from Yahoo! Answers and compared against a rule-based string similarity baseline. On our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. UMLS-based (health-related) features used in the model enhance the algorithm's performance by proximately 8 %. A reasonably high rate of accuracy is obtained given that the data is considerably noisy. Important features distinguishing a valid answer from an invalid answer include text length, number of stop words contained in a test question, a distance between the test question and other questions in the corpus as well as a number of overlapping health-related terms between questions. Overall, our automated QA system based on historical QA pairs is shown to be effective according to the data set in this case study. It is developed for general use in the health care domain which can also be applied to other CQA sites.

preprint2016arXiv

Clustering Time-Series Energy Data from Smart Meters

Investigations have been performed into using clustering methods in data mining time-series data from smart meters. The problem is to identify patterns and trends in energy usage profiles of commercial and industrial customers over 24-hour periods, and group similar profiles. We tested our method on energy usage data provided by several U.S. power utilities. The results show accurate grouping of accounts similar in their energy usage patterns, and potential for the method to be utilized in energy efficiency programs.

preprint2016arXiv

Evaluating the Performance of Offensive Linemen in the NFL

How does one objectively measure the performance of an individual offensive lineman in the NFL? The existing literature proposes various measures that rely on subjective assessments of game film, but has yet to develop an objective methodology to evaluate performance. Using a variety of statistics related to an offensive lineman's performance, we develop a framework to objectively analyze the overall performance of an individual offensive lineman and determine specific linemen who are overvalued or undervalued relative to their salary. We identify eight players across the 2013-2014 and 2014-2015 NFL seasons that are considered to be overvalued or undervalued and corroborate the results with existing metrics that are based on subjective evaluation. To the best of our knowledge, the techniques set forth in this work have not been utilized in previous works to evaluate the performance of NFL players at any position, including offensive linemen.

preprint2016arXiv

Going Out of Business: Auction House Behavior in the Massively Multi-Player Online Game

The in-game economies of massively multi-player online games (MMOGs) are complex systems that have to be carefully designed and managed. This paper presents the results of an analysis of auction house data from the MMOG Glitch, across a 14 month time period, the entire lifetime of the game. The data comprise almost 3 million data points, over 20,000 unique players and more than 650 products. Furthermore, an interactive visualization, based on Sankey flow diagrams, is presented which shows the proportion of the different clusters across each time bin, as well as the flow of players between clusters. The diagram allows evaluation of migration of players between clusters as a function of time, as well as churn analysis. The presented work provides a template analysis and visualization model for progression-based or temporal-based analysis of player behavior broadly applicable to games.

preprint2016arXiv

Iteratively Reweighted Least Squares Algorithms for L1-Norm Principal Component Analysis

Principal component analysis (PCA) is often used to reduce the dimension of data by selecting a few orthonormal vectors that explain most of the variance structure of the data. L1 PCA uses the L1 norm to measure error, whereas the conventional PCA uses the L2 norm. For the L1 PCA problem minimizing the fitting error of the reconstructed data, we propose an exact reweighted and an approximate algorithm based on iteratively reweighted least squares. We provide convergence analyses, and compare their performance against benchmark algorithms in the literature. The computational experiment shows that the proposed algorithms consistently perform best.

preprint2016arXiv

Predicting litigation likelihood and time to litigation for patents

Patent lawsuits are costly and time-consuming. An ability to forecast a patent litigation and time to litigation allows companies to better allocate budget and time in managing their patent portfolios. We develop predictive models for estimating the likelihood of litigation for patents and the expected time to litigation based on both textual and non-textual features. Our work focuses on improving the state-of-the-art by relying on a different set of features and employing more sophisticated algorithms with more realistic data. The rate of patent litigations is very low, which consequently makes the problem difficult. The initial model for predicting the likelihood is further modified to capture a time-to-litigation perspective.

preprint2016arXiv

Predictive Analytics Using Smartphone Sensors for Depressive Episodes

The behaviors of patients with depression are usually difficult to predict because the patients demonstrate the symptoms of a depressive episode without a warning at unexpected times. The goal of this research is to build algorithms that detect signals of such unusual moments so that doctors can be proactive in approaching already diagnosed patients before they fall in depression. Each patient is equipped with a smartphone with the capability to track its sensors. We first find the home location of a patient, which is then augmented with other sensor data to identify sleep patterns and select communication patterns. The algorithms require two to three weeks of training data to build standard patterns, which are considered normal behaviors; and then, the methods identify any anomalies in day-to-day data readings of sensors. Four smartphone sensors, including the accelerometer, the gyroscope, the location probe and the communication log probe are used for anomaly detection in sleeping and communication patterns.

preprint2016arXiv

Rapid Prediction of Player Retention in Free-to-Play Mobile Games

Predicting and improving player retention is crucial to the success of mobile Free-to-Play games. This paper explores the problem of rapid retention prediction in this context. Heuristic modeling approaches are introduced as a way of building simple rules for predicting short-term retention. Compared to common classification algorithms, our heuristic-based approach achieves reasonable and comparable performance using information from the first session, day, and week of player activity.

preprint2016arXiv

Semantic Properties of Customer Sentiment in Tweets

An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.

preprint2016arXiv

Skill-Based Differences in Spatio-Temporal Team Behavior in Defence of The Ancients 2

Multiplayer Online Battle Arena (MOBA) games are among the most played digital games in the world. In these games, teams of players fight against each other in arena environments, and the gameplay is focused on tactical combat. Mastering MOBAs requires extensive practice, as is exemplified in the popular MOBA Defence of the Ancients 2 (DotA 2). In this paper, we present three data-driven measures of spatio-temporal behavior in DotA 2: 1) Zone changes; 2) Distribution of team members and: 3) Time series clustering via a fuzzy approach. We present a method for obtaining accurate positional data from DotA 2. We investigate how behavior varies across these measures as a function of the skill level of teams, using four tiers from novice to professional players. Results indicate that spatio-temporal behavior of MOBA teams is related to team skill, with professional teams having smaller within-team distances and conducting more zone changes than amateur teams. The temporal distribution of the within-team distances of professional and high-skilled teams also generally follows patterns distinct from lower skill ranks.

preprint2016arXiv

Temporal Topic Analysis with Endogenous and Exogenous Processes

We consider the problem of modeling temporal textual data taking endogenous and exogenous processes into account. Such text documents arise in real world applications, including job advertisements and economic news articles, which are influenced by the fluctuations of the general economy. We propose a hierarchical Bayesian topic model which imposes a "group-correlated" hierarchical structure on the evolution of topics over time incorporating both processes, and show that this model can be estimated from Markov chain Monte Carlo sampling methods. We further demonstrate that this model captures the intrinsic relationships between the topic distribution and the time-dependent factors, and compare its performance with latent Dirichlet allocation (LDA) and two other related models. The model is applied to two collections of documents to illustrate its empirical performance: online job advertisements from DirectEmployers Association and journalists' postings on BusinessInsider.com.

Diego Klabjan

What is connected

Connect this record

See the researcher in context

Building this map preview

32 published item(s)

Model-based Bootstrap of Controlled Markov Chains

Robust softmax aggregation on blockchain based federated learning with convergence guarantee

Open-Set Recognition of Breast Cancer Treatments

Topic Analysis for Text with Side Data

Tricks and Plugins to GBM on Images and Sequences

Classification Models for Partially Ordered Sequences

Layer Flexible Adaptive Computational Time

Predicting Shot Making in Basketball Learnt from Adversarial Multiagent Trajectories

A Simple and Fast Algorithm for L1-norm Kernel PCA

Efficient Architecture Search for Continual Learning

Keyword-based Topic Modeling and Keyword Selection

Listwise Learning to Rank by Exploring Unique Ratings

Mixture-based Multiple Imputation Model for Clinical Data with a Temporal Dimension

Neural Network Retraining for Model Serving

Open Set Domain Adaptation by Extreme Value Theory

Open-Set Recognition with Gaussian Mixture Variational Autoencoders

Scale Invariant Power Iteration

Subset Selection for Multiple Linear Regression via Optimization

The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent

Semantic Document Distance Measures and Unsupervised Document Revision Detection

Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling

A Semi-supervised learning approach to enhance health care Community-based Question Answering: A case study in alcoholism

Clustering Time-Series Energy Data from Smart Meters

Evaluating the Performance of Offensive Linemen in the NFL

Going Out of Business: Auction House Behavior in the Massively Multi-Player Online Game

Iteratively Reweighted Least Squares Algorithms for L1-Norm Principal Component Analysis

Predicting litigation likelihood and time to litigation for patents

Predictive Analytics Using Smartphone Sensors for Depressive Episodes

Rapid Prediction of Player Retention in Free-to-Play Mobile Games

Semantic Properties of Customer Sentiment in Tweets

Skill-Based Differences in Spatio-Temporal Team Behavior in Defence of The Ancients 2

Temporal Topic Analysis with Endogenous and Exogenous Processes