Source author record

Guy Lebanon

Guy Lebanon appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language Information Retrieval Computer Vision Cryptography and Security Digital Libraries Graphics Human-Computer Interaction

Catalog footprint

What is connected

20works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2014arXiv

Fast Spammer Detection Using Structural Rank

Comments for a product or a news article are rapidly growing and became a medium of measuring quality products or services. Consequently, spammers have been emerged in this area to bias them toward their favor. In this paper, we propose an efficient spammer detection method using structural rank of author specific term-document matrices. The use of structural rank was found effective and far faster than similar methods.

preprint2013arXiv

A Linear Approximation to the chi^2 Kernel with Geometric Convergence

We propose a new analytical approximation to the $χ^2$ kernel that converges geometrically. The analytical approximation is derived with elementary methods and adapts to the input distribution for optimal convergence rate. Experiments show the new approximation leads to improved performance in image classification and semantic segmentation tasks using a random Fourier feature approximation of the $\exp-χ^2$ kernel. Besides, out-of-core principal component analysis (PCA) methods are introduced to reduce the dimensionality of the approximation and achieve better performance at the expense of only an additional constant factor to the time complexity. Moreover, when PCA is performed jointly on the training and unlabeled testing data, further performance improvements can be obtained. Experiments conducted on the PASCAL VOC 2010 segmentation and the ImageNet ILSVRC 2010 datasets show statistically significant improvements over alternative approximation methods.

preprint2013arXiv

Beyond Sentiment: The Manifold of Human Emotions

Sentiment analysis predicts the presence of positive or negative emotions in a text document. In this paper we consider higher dimensional extensions of the sentiment concept, which represent a richer set of human emotions. Our approach goes beyond previous work in that our model contains a continuous manifold rather than a finite set of human emotions. We investigate the resulting model, compare it to psychological observations, and explore its predictive capabilities. Besides obtaining significant improvements over a baseline without manifold, we are also able to visualize different notions of positive sentiment in different domains.

preprint2013arXiv

Local Space-Time Smoothing for Version Controlled Documents

Unlike static documents, version controlled documents are continuously edited by one or more authors. Such collaborative revision process makes traditional modeling and visualization techniques inappropriate. In this paper we propose a new representation based on local space-time smoothing that captures important revision patterns. We demonstrate the applicability of our framework using experiments on synthetic and real-world data.

preprint2013arXiv

Matrix Approximation under Local Low-Rank Assumption

Matrix approximation is a common tool in machine learning for building accurate prediction models for recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of low-rank. We propose a new matrix approximation model where we assume instead that the matrix is only locally of low-rank, leading to a representation of the observed matrix as a weighted sum of low-rank matrices. We analyze the accuracy of the proposed local low-rank modeling. Our experiments show improvements in prediction accuracy in recommendation tasks.

preprint2013arXiv

The Manifold of Human Emotions

Sentiment analysis predicts the presence of positive or negative emotions in a text document. In this paper, we consider higher dimensional extensions of the sentiment concept, which represent a richer set of human emotions. Our approach goes beyond previous work in that our model contains a continuous manifold rather than a finite set of human emotions. We investigate the resulting model, compare it to psychological observations, and explore its predictive capabilities.

preprint2012arXiv

A Comparative Study of Collaborative Filtering Algorithms

Collaborative filtering is a rapidly advancing research area. Every year several new techniques are proposed and yet it is not clear which of the techniques work best and under what conditions. In this paper we conduct a study comparing several collaborative filtering techniques -- both classic and recent state-of-the-art -- in a variety of experimental contexts. Specifically, we report conclusions controlling for number of items, number of users, sparsity level, performance criteria, and computational complexity. Our conclusions identify what algorithms work well and in what conditions, and contribute to both industrial deployment collaborative filtering algorithms and to the research community.

preprint2012arXiv

An Extended Cencov-Campbell Characterization of Conditional Information Geometry

We formulate and prove an axiomatic characterization of conditional information geometry, for both the normalized and the nonnormalized cases. This characterization extends the axiomatic derivation of the Fisher geometry by Cencov and Campbell to the cone of positive conditional models, and as a special case to the manifold of conditional distributions. Due to the close connection between the conditional I-divergence and the product Fisher information metric the characterization provides a new axiomatic interpretation of the primal problems underlying logistic regression and AdaBoost.

preprint2012arXiv

Cumulative Revision Map

Unlike static documents, version-controlled documents are edited by one or more authors over a certain period of time. Examples include large scale computer code, papers authored by a team of scientists, and online discussion boards. Such collaborative revision process makes traditional document modeling and visualization techniques inappropriate. In this paper we propose a new visualization technique for version-controlled documents that reveals interesting authoring patterns in papers, computer code and Wikipedia articles. The revealed authoring patterns are useful for the readers, participants in the authoring process, and supervisors.

preprint2012arXiv

Domain Knowledge Uncertainty and Probabilistic Parameter Constraints

Incorporating domain knowledge into the modeling process is an effective way to improve learning accuracy. However, as it is provided by humans, domain knowledge can only be specified with some degree of uncertainty. We propose to explicitly model such uncertainty through probabilistic constraints over the parameter space. In contrast to hard parameter constraints, our approach is effective also when the domain knowledge is inaccurate and generally results in superior modeling accuracy. We focus on generative and conditional modeling where the parameters are assigned a Dirichlet or Gaussian prior and demonstrate the framework with experiments on both synthetic and real-world data.

preprint2012arXiv

Learning Riemannian Metrics

We propose a solution to the problem of estimating a Riemannian metric associated with a given differentiable manifold. The metric learning problem is based on minimizing the relative volume of a given set of points. We derive the details for a family of metrics on the multinomial simplex. The resulting metric has applications in text classification and bears some similarity to TFIDF representation of text documents.

preprint2012arXiv

Sequential Document Representations and Simplicial Curves

The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present a continuous and differentiable sequential document representation that goes beyond the bag of words assumption, and yet is efficient and effective. This representation employs smooth curves in the multinomial simplex to account for sequential information. We discuss the representation and its geometric properties and demonstrate its applicability for the task of text classification.

preprint2012arXiv

Smooth Sparse Coding via Marginal Regression for Learning Sparse Representations

We propose and analyze a novel framework for learning sparse representations, based on two statistical techniques: kernel smoothing and marginal regression. The proposed approach provides a flexible framework for incorporating feature similarity or temporal information present in data sets, via non-parametric kernel smoothing. We provide generalization bounds for dictionary learning using smooth sparse coding and show how the sample complexity depends on the L1 norm of kernel function used. Furthermore, we propose using marginal regression for obtaining sparse codes, which significantly improves the speed and allows one to scale to large dictionary sizes easily. We demonstrate the advantages of the proposed approach, both in terms of accuracy and speed by extensive experimentation on several real data sets. In addition, we demonstrate how the proposed approach could be used for improving semi-supervised sparse coding.

preprint2012arXiv

Statistical Translation, Heat Kernels and Expected Distances

High dimensional structured data such as text and images is often poorly understood and misrepresented in statistical modeling. The standard histogram representation suffers from high variance and performs poorly in general. We explore novel connections between statistical translation, heat kernels on manifolds and graphs, and expected distances. These connections provide a new framework for unsupervised metric learning for text documents. Experiments indicate that the resulting distances are generally superior to their more standard counterparts.

preprint2012arXiv

The Landmark Selection Method for Multiple Output Prediction

Conditional modeling x \to y is a central problem in machine learning. A substantial research effort is devoted to such modeling when x is high dimensional. We consider, instead, the case of a high dimensional y, where x is either low dimensional or high dimensional. Our approach is based on selecting a small subset y_L of the dimensions of y, and proceed by modeling (i) x \to y_L and (ii) y_L \to y. Composing these two models, we obtain a conditional model x \to y that possesses convenient statistical properties. Multi-label classification and multivariate regression experiments on several datasets show that this model outperforms the one vs. all approach as well as several sophisticated multiple output prediction methods.

preprint2010arXiv

Asymptotic Analysis of Generative Semi-Supervised Learning

Semisupervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP.

preprint2010arXiv

Estimating Probabilities in Recommendation Systems

Recommendation systems are emerging as an important business application with significant economic impact. Currently popular systems include Amazon's book recommendations, Netflix's movie recommendations, and Pandora's music recommendations. In this paper we address the problem of estimating probabilities associated with recommendation system data using non-parametric kernel smoothing. In our estimation we interpret missing items as randomly censored observations and obtain efficient computation schemes using combinatorial properties of generating functions. We demonstrate our approach with several case studies involving real world movie recommendation data. The results are comparable with state-of-the-art techniques while also providing probabilistic preference estimates outside the scope of traditional recommender systems.

preprint2010arXiv

Linguistic Geometries for Unsupervised Dimensionality Reduction

Text documents are complex high dimensional objects. To effectively visualize such data it is important to reduce its dimensionality and visualize the low dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore dimensionality reduction methods that draw upon domain knowledge in order to achieve a better low dimensional embedding and visualization of documents. We consider the use of geometries specified manually by an expert, geometries derived automatically from corpus statistics, and geometries computed from linguistic resources.

preprint2010arXiv

Statistical and Computational Tradeoffs in Stochastic Composite Likelihood

Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random fields. The theoretical and experimental studies demonstrate the effectiveness of the estimators when the computational resources are insufficient. They also demonstrate that in some cases reduced computational complexity is associated with robustness thereby increasing statistical accuracy.

preprint2010arXiv

Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels

Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled dataset. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional datasets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever.

Guy Lebanon

What is connected

Connect this record

See the researcher in context

Building this map preview

20 published item(s)

Fast Spammer Detection Using Structural Rank

A Linear Approximation to the chi^2 Kernel with Geometric Convergence

Beyond Sentiment: The Manifold of Human Emotions

Local Space-Time Smoothing for Version Controlled Documents

Matrix Approximation under Local Low-Rank Assumption

The Manifold of Human Emotions

A Comparative Study of Collaborative Filtering Algorithms

An Extended Cencov-Campbell Characterization of Conditional Information Geometry

Cumulative Revision Map

Domain Knowledge Uncertainty and Probabilistic Parameter Constraints

Learning Riemannian Metrics

Sequential Document Representations and Simplicial Curves

Smooth Sparse Coding via Marginal Regression for Learning Sparse Representations

Statistical Translation, Heat Kernels and Expected Distances

The Landmark Selection Method for Multiple Output Prediction

Asymptotic Analysis of Generative Semi-Supervised Learning

Estimating Probabilities in Recommendation Systems

Linguistic Geometries for Unsupervised Dimensionality Reduction

Statistical and Computational Tradeoffs in Stochastic Composite Likelihood

Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels