Topic overview

Information Retrieval

3870 works11775 researchers

Open map Browse papers

Map preview

Start with the graph, then narrow the list

3870works

11775researchers

Next steps

Use the topic as a working map

Open the full map for clusters, then return here to scan ranked papers and people.

Inspect nearby papers, researchers, institutions and communities without opening a separate graph page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2017arXiv

Pyndri: a Python Interface to the Indri Search Engine

We introduce pyndri, a Python interface to the Indri search engine. Pyndri allows to access Indri indexes from Python at two levels: (1) dictionary and tokenized document collection, (2) evaluating queries on the index. We hope that with the release of pyndri, we will stimulate reproducible, open and fast-paced IR research.

preprint2016arXiv

PAMPO: using pattern matching and pos-tagging for effective Named Entities recognition in Portuguese

This paper deals with the entity extraction task (named entity recognition) of a text mining process that aims at unveiling non-trivial semantic structures, such as relationships and interaction between entities or communities. In this paper we present a simple and efficient named entity extraction algorithm. The method, named PAMPO (PAttern Matching and POs tagging based algorithm for NER), relies on flexible pattern matching, part-of-speech tagging and lexical-based rules. It was developed to process texts written in Portuguese, however it is potentially applicable to other languages as well. We compare our approach with current alternatives that support Named Entity Recognition (NER) for content written in Portuguese. These are Alchemy, Zemanta and Rembrandt. Evaluation of the efficacy of the entity extraction method on several texts written in Portuguese indicates a considerable improvement on $recall$ and $F_1$ measures.

preprint2017arXiv

Self-Taught Convolutional Neural Networks for Short Text Clustering

Short text clustering is a challenging problem due to its sparseness of text representation. Here we propose a flexible Self-Taught Convolutional neural network framework for Short Text Clustering (dubbed STC^2), which can flexibly and successfully incorporate more useful semantic features and learn non-biased deep text representation in an unsupervised manner. In our framework, the original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pre-trained binary codes in the training process. Finally, we get the optimal clusters by employing K-means to cluster the learned representations. Extensive experimental results demonstrate that the proposed framework is effective, flexible and outperform several popular clustering methods when tested on three public short text datasets.

preprint2017arXiv

Mixed one-bit compressive sensing with applications to overexposure correction for CT reconstruction

When a measurement falls outside the quantization or measurable range, it becomes saturated and cannot be used in classical reconstruction methods. For example, in C-arm angiography systems, which provide projection radiography, fluoroscopy, digital subtraction angiography, and are widely used for medical diagnoses and interventions, the limited dynamic range of C-arm flat detectors leads to overexposure in some projections during an acquisition, such as imaging relatively thin body parts (e.g., the knee). Aiming at overexposure correction for computed tomography (CT) reconstruction, we in this paper propose a mixed one-bit compressive sensing (M1bit-CS) to acquire information from both regular and saturated measurements. This method is inspired by the recent progress on one-bit compressive sensing, which deals with only sign observations. Its successful applications imply that information carried by saturated measurements is useful to improve recovery quality. For the proposed M1bit-CS model, alternating direction methods of multipliers is developed and an iterative saturation detection scheme is established. Then we evaluate M1bit-CS on one-dimensional signal recovery tasks. In s

preprint2017arXiv

Leveraging Multi-aspect Time-related Influence in Location Recommendation

Point-Of-Interest (POI) recommendation aims to mine a user's visiting history and find her/his potentially preferred places. Although location recommendation methods have been studied and improved pervasively, the challenges w.r.t employing various influences including temporal aspect still remain. Inspired by the fact that time includes numerous granular slots (e.g. minute, hour, day, week and etc.), in this paper, we define a new problem to perform recommendation through exploiting all diversified temporal factors. In particular, we argue that most existing methods only focus on a limited number of time-related features and neglect others. Furthermore, considering a specific granularity (e.g. time of a day) in recommendation cannot always apply to each user or each dataset. To address the challenges, we propose a probabilistic generative model, named after Multi-aspect Time-related Influence (MATI) to promote POI recommendation. We also develop a novel optimization algorithm based on Expectation Maximization (EM). Our MATI model firstly detects a user's temporal multivariate orientation using her check-in log in Location-based Social Networks(LBSNs). It then performs reco

preprint2017arXiv

Interactive Movie Recommendation Through Latent Semantic Analysis and Storytelling

Recommendation has become one of the most important components of online services for improving sale records, however visualization work for online recommendation is still very limited. This paper presents an interactive recommendation approach with the following two components. First, rating records are the most widely used data for online recommendation, but they are often processed in high-dimensional spaces that can not be easily understood or interacted with. We propose a Latent Semantic Model (LSM) that captures the statistical features of semantic concepts on 2D domains and abstracts user preferences for personal recommendation. Second, we propose an interactive recommendation approach through a storytelling mechanism for promoting the communication between the user and the recommendation system. Our approach emphasizes interactivity, explicit user input, and semantic information convey; thus it can be used by general users without any knowledge of recommendation or visualization algorithms. We validate our model with data statistics and demonstrate our approach with case studies from the MovieLens100K dataset. Our approaches of latent semantic analysis and interactive recom

preprint2016arXiv

Automatic Data Deformation Analysis on Evolving Folksonomy Driven Environment

The Folksodriven framework makes it possible for data scientists to define an ontology environment where searching for buried patterns that have some kind of predictive power to build predictive models more effectively. It accomplishes this through an abstractions that isolate parameters of the predictive modeling process searching for patterns and designing the feature set, too. To reflect the evolving knowledge, this paper considers ontologies based on folksonomies according to a new concept structure called "Folksodriven" to represent folksonomies. So, the studies on the transformational regulation of the Folksodriven tags are regarded to be important for adaptive folksonomies classifications in an evolving environment used by Intelligent Systems to represent the knowledge sharing. Folksodriven tags are used to categorize salient data points so they can be fed to a machine-learning system and "featurizing" the data.

preprint2017arXiv

Collaborative Filtering with Recurrent Neural Networks

We show that collaborative filtering can be viewed as a sequence prediction problem, and that given this interpretation, recurrent neural networks offer very competitive approach. In particular we study how the long short-term memory (LSTM) can be applied to collaborative filtering, and how it compares to standard nearest neighbors and matrix factorization methods on movie recommendation. We show that the LSTM is competitive in all aspects, and largely outperforms other methods in terms of item coverage and short term predictions.

preprint2016arXiv

Who Ordered This?: Exploiting Implicit User Tag Order Preferences for Personalized Image Tagging

What makes a person pick certain tags over others when tagging an image? Does the order that a person presents tags for a given image follow an implicit bias that is personal? Can these biases be used to improve existing automated image tagging systems? We show that tag ordering, which has been largely overlooked by the image tagging community, is an important cue in understanding user tagging behavior and can be used to improve auto-tagging systems. Inspired by the assumption that people order their tags, we propose a new way of measuring tag preferences, and also propose a new personalized tagging objective function that explicitly considers a user's preferred tag orderings. We also provide a (partially) greedy algorithm that produces good solutions to our new objective and under certain conditions produces an optimal solution. We validate our method on a subset of Flickr images that spans 5000 users, over 5200 tags, and over 90,000 images. Our experiments show that exploiting personalized tag orders improves the average performance of state-of-art approaches both on per-image and per-user bases.

preprint2016arXiv

Audio-based Distributional Semantic Models for Music Auto-tagging and Similarity Measurement

The recent development of Audio-based Distributional Semantic Models (ADSMs) enables the computation of audio and lexical vector representations in a joint acoustic-semantic space. In this work, these joint representations are applied to the problem of automatic tag generation. The predicted tags together with their corresponding acoustic representation are exploited for the construction of acoustic-semantic clip embeddings. The proposed algorithms are evaluated on the task of similarity measurement between music clips. Acoustic-semantic models are shown to outperform the state-of-the-art for this task and produce high quality tags for audio/music clips.

preprint2016arXiv

Distributed Real-Time Sentiment Analysis for Big Data Social Streams

Big data trend has enforced the data-centric systems to have continuous fast data streams. In recent years, real-time analytics on stream data has formed into a new research field, which aims to answer queries about what-is-happening-now with a negligible delay. The real challenge with real-time stream data processing is that it is impossible to store instances of data, and therefore online analytical algorithms are utilized. To perform real-time analytics, pre-processing of data should be performed in a way that only a short summary of stream is stored in main memory. In addition, due to high speed of arrival, average processing time for each instance of data should be in such a way that incoming instances are not lost without being captured. Lastly, the learner needs to provide high analytical accuracy measures. Sentinel is a distributed system written in Java that aims to solve this challenge by enforcing both the processing and learning process to be done in distributed form. Sentinel is built on top of Apache Storm, a distributed computing platform. Sentinels learner, Vertical Hoeffding Tree, is a parallel decision tree-learning algorithm based on the VFDT, with ability of ena

preprint2016arXiv

Condensedly: comprehending article contents through condensed texts

Summary: Abstracts in biomedical articles can provide a quick overview of the articles but detailed information cannot be obtained without reading full-text contents. Full-text articles certainly generate more information and contents; however, accessing full-text documents is usually time consuming. Condensedly is a web-based application, which provides readers an easy and efficient way to access full-text paragraphs using sentences in abstracts as fishing bait to retrieve the big fish reside in full-text. Condensedly is based on the paragraph ranking algorithm, which evaluates and ranks full-text paragraphs based on their association scores with sentences in abstracts. Availability: http://140.116.247.185/~research/Condensedly

preprint2016arXiv

Finding Influential Institutions in Bibliographic Information Networks

Ranking in bibliographic information networks is a widely studied problem due to its many applications such as advertisement industry, funding, search engines, etc. Most of the existing works on ranking in bibliographic information network are based on ranking of research papers and their authors. But the bibliographic information network can be used for solving other important problems as well. The KDD Cup $2016$ competition considers one such problem, which is to measure the impact of research institutions, i.e. to perform ranking of research institutions. The competition took place in three phases. In this paper, we discuss our solutions for ranking institutions in each phase. We participated under team name "anu@TASL" and our solutions achieved the average NDCG@$20$ score of $0.7483$, ranking in eleventh place in the contest.

preprint2016arXiv

A deep learning approach for predicting the quality of online health expert question-answering services

Currently, a growing number of health consumers are asking health-related questions online, at any time and from anywhere, which effectively lowers the cost of health care. The most common approach is using online health expert question-answering (HQA) services, as health consumers are more willing to trust answers from professional physicians. However, these answers can be of varying quality depending on circumstance. In addition, as the available HQA services grow, how to predict the answer quality of HQA services via machine learning becomes increasingly important and challenging. In an HQA service, answers are normally short texts, which are severely affected by the data sparsity problem. Furthermore, HQA services lack community features such as best answer and user votes. Therefore, the wisdom of the crowd is not available to rate answer quality. To address these problems, in this paper, the prediction of HQA answer quality is defined as a classification task. First, based on the characteristics of HQA services and feedback from medical experts, a standard for HQA service answer quality evaluation is defined. Next, based on the characteristics of HQA services, several novel no

preprint2016arXiv

JU_KS_Group@FIRE 2016: Consumer Health Information Search

In this paper, we describe the methodology used and the results obtained by us for completing the tasks given under the shared task on Consumer Health Information Search (CHIS) collocated with the Forum for Information Retrieval Evaluation (FIRE) 2016, ISI Kolkata. The shared task consists of two sub-tasks - (1) task1: given a query and a document/set of documents associated with that query, the task is to classify the sentences in the document as relevant to the query or not and (2) task 2: the relevant sentences need to be further classified as supporting the claim made in the query, or opposing the claim made in the query. We have participated in both the sub-tasks. The percentage accuracy obtained by our developed system for task1 was 73.39 which is third highest among the 9 teams participated in the shared task.

preprint2016arXiv

Low-dimensional Query Projection based on Divergence Minimization Feedback Model for Ad-hoc Retrieval

Low-dimensional word vectors have long been used in a wide range of applications in natural language processing. In this paper we shed light on estimating query vectors in ad-hoc retrieval where a limited information is available in the original query. Pseudo-relevance feedback (PRF) is a well-known technique for updating query language models and expanding the queries with a number of relevant terms. We formulate the query updating in low-dimensional spaces first with rotating the query vector and then with scaling. These consequential steps are embedded in a query-specific projection matrix capturing both angle and scaling. In this paper we propose a new but not the most effective technique necessarily for PRF in language modeling, based on the query projection algorithm. We learn an embedded coefficient matrix for each query, whose aim is to improve the vector representation of the query by transforming it to a more reliable space, and then update the query language model. The proposed embedded coefficient divergence minimization model (ECDMM) takes top-ranked documents retrieved by the query and obtains a couple of positive and negative sample sets; these samples are used for l

preprint2016arXiv

Classification and Learning-to-rank Approaches for Cross-Device Matching at CIKM Cup 2016

In this paper, we propose two methods for tackling the problem of cross-device matching for online advertising at CIKM Cup 2016. The first method considers the matching problem as a binary classification task and solve it by utilizing ensemble learning techniques. The second method defines the matching problem as a ranking task and effectively solve it with using learning-to-rank algorithms. The results show that the proposed methods obtain promising results, in which the ranking-based method outperforms the classification-based method for the task.

preprint2016arXiv

Development of UMLS Based Health Care Web Services for Android Platform

In this fast developing world of information, the amount of medical knowledge is rising at an exponential level. The UMLS (Unified Medical Language Systems), is rich knowledge base consisting files and software that provides many health and biomedical vocabularies and standards. A Web service is a web solution to facilitate machine-to-machine interaction over a network. Few UMLS web services are currently available for portable devices, but most of them lack in efficiency and performance. It is proposed to develop Android-based web services for healthcare systems underlying rich knowledge source of UMLS. The experimental evaluation was made to analyse the efficiency and performance effect with and without using the designed prototype. The understand-ability and interaction with the prototype were greater than those who used the alternate sources to obtain the answers to their questions. The overall performance indicates that the system is convenient and easy to use. The result of the evaluation clearly proved that designed system retrieves all the pertinent information better than syntactic searches.

preprint2016arXiv

Latent Tree Models for Hierarchical Topic Detection

We present a novel method for hierarchical topic detection where topics are obtained by clustering documents in multiple ways. Specifically, we model document collections using a class of graphical models called hierarchical latent tree models (HLTMs). The variables at the bottom level of an HLTM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest latent level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. Latent variables at high levels of the hierarchy capture long-range word co-occurrence patterns and hence give thematically more general topics, while those at low levels of the hierarchy capture short-range word co-occurrence patterns and give thematically more specific topics. Unlike LDA-based topic models, HLTMs do not refer to a document generation process and use word variables instead of token variables. They use a tree structure to model the relati

preprint2016arXiv

Video Stream Retrieval of Unseen Queries using Semantic Memory

Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a query's semantic relatedness to pre-trained concept classifiers. To adapt to shifting video content, we propose memory pooling and memory welling methods that favor recent information over long past content. We identify two stream retrieval tasks, instantaneous retrieval at any particular time and continuous retrieval over a prolonged duration, and propose means for evaluating them. Three large scale video datasets are adapted to the challenge of stream retrieval. We report results for our search methods on the new stream retrieval tasks, as well as demonstrate their efficacy in a traditional, non-streaming video task.

preprint2016arXiv

Exploiting sparsity to build efficient kernel based collaborative filtering for top-N item recommendation

The increasing availability of implicit feedback datasets has raised the interest in developing effective collaborative filtering techniques able to deal asymmetrically with unambiguous positive feedback and ambiguous negative feedback. In this paper, we propose a principled kernel-based collaborative filtering method for top-N item recommendation with implicit feedback. We present an efficient implementation using the linear kernel, and we show how to generalize it to kernels of the dot product family preserving the efficiency. We also investigate on the elements which influence the sparsity of a standard cosine kernel. This analysis shows that the sparsity of the kernel strongly depends on the properties of the dataset, in particular on the long tail distribution. We compare our method with state-of-the-art algorithms achieving good results both in terms of efficiency and effectiveness.

preprint2016arXiv

A Scalable Document-based Architecture for Text Analysis

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g., stem or lemma extraction, part-of-speech tagging, named entities recognition...), and performance and scaling issues. Existing text analysis architectures partly solve these issues, providing restrictive data schemas, addressing only one aspect of text preprocessing and focusing on one single task when dealing with performance optimization. %As a result, no definite solution is currently available. Thus, we propose in this paper a new generic text analysis architecture, where document structure is flexible, many preprocessing techniques are integrated and textual datasets are indexed for efficient access. We implement our conceptual architecture using both a relational and a document-oriented database. Our experiments demonstrate the feasibility of our approach and the superiority of the document-oriented logical and physical implementation.

preprint2016arXiv

Data-Driven Relevance Judgments for Ranking Evaluation

Ranking evaluation metrics are a fundamental element of design and improvement efforts in information retrieval. We observe that most popular metrics disregard information portrayed in the scores used to derive rankings, when available. This may pose a numerical scaling problem, causing an under- or over-estimation of the evaluation depending on the degree of divergence between the scores of ranked items. The purpose of this work is to propose a principled way of quantifying multi-graded relevance judgments of items and enable a more accurate penalization of ordering errors in rankings. We propose a data-driven generation of relevance functions based on the degree of the divergence amongst a set of items' scores and its application in the evaluation metric Normalized Discounted Cumulative Gain (nDCG). We use synthetic data to demonstrate the interest of our proposal and a combination of data on news items from Google News and their respective popularity in Twitter to show its performance in comparison to the standard nDCG. Results show that our proposal is capable of providing a more fine-grained evaluation of rankings when compared to the standard nDCG, and that the latter fre

preprint2016arXiv

Towards End-to-End Audio-Sheet-Music Retrieval

This paper demonstrates the feasibility of learning to retrieve short snippets of sheet music (images) when given a short query excerpt of music (audio) -- and vice versa --, without any symbolic representation of music or scores. This would be highly useful in many content-based musical retrieval scenarios. Our approach is based on Deep Canonical Correlation Analysis (DCCA) and learns correlated latent spaces allowing for cross-modality retrieval in both directions. Initial experiments with relatively simple monophonic music show promising results.

498 works