Source author record

Lucas Hosseini

Lucas Hosseini appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Information Retrieval math.CO Data Structures and Algorithms Databases Machine Learning Performance

Catalog footprint

What is connected

6works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Improving Wikipedia Verifiability with AI

Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not support a given claim or become obsolete once the original source is updated or deleted. Hence, maintaining and improving the quality of Wikipedia references is an important challenge and there is a pressing need for better tools to assist humans in this effort. Here, we show that the process of improving references can be tackled with the help of artificial intelligence (AI). We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web. We train this model on existing Wikipedia references, therefore learning from the contributions and combined wisdom of thousands of Wikipedia editors. Using crowd-sourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system's suggested alternatives compared to the originally cited reference 70% of the time. To validate the applicability of our system, we built a demo to engage with the English-speaking Wikipedia community and find that Side's first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims according to Side. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia. More generally, we hope that our work can be used to assist fact checking efforts and increase the general trustworthiness of information online.

preprint2022arXiv

Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-à-vis their hardware cost. This competition compares ANNS algorithms at billion-scale by hardware cost, accuracy and performance. We set up an open source evaluation framework and leaderboards for both standardized and specialized hardware. The competition involves three tracks. The standard hardware track T1 evaluates algorithms on an Azure VM with limited DRAM, often the bottleneck in serving billion-scale indices, where the embedding data can be hundreds of GigaBytes in size. It uses FAISS~\citep{Faiss17} as the baseline. The standard hardware track T2 additional allows inexpensive SSDs in addition to the limited DRAM and uses DiskANN~\citep{DiskANN19} as the baseline. The specialized hardware track T3 allows any hardware configuration, and again uses FAISS as the baseline. We compiled six diverse billion-scale datasets, four newly released for this competition, that span a variety of modalities, data types, dimensions, deep learning models, distance functions and sources. The outcome of the competition was ranked leaderboards of algorithms in each track based on recall at a query throughput threshold. Additionally, for track T3, separate leaderboards were created based on recall as well as cost-normalized and power-normalized query throughput.

preprint2022arXiv

Unsupervised Dense Information Retrieval with Contrastive Learning

Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new applications with no training data, and are outperformed by unsupervised term-frequency methods such as BM25. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers and show that it leads to strong performance in various retrieval settings. On the BEIR benchmark our unsupervised model outperforms BM25 on 11 out of 15 datasets for the Recall@100. When used as pre-training before fine-tuning, either on a few thousands in-domain examples or on the large MS~MARCO dataset, our contrastive model leads to improvements on the BEIR benchmark. Finally, we evaluate our approach for multi-lingual retrieval, where training data is even scarcer than for English, and show that our approach leads to strong unsupervised performance. Our model also exhibits strong cross-lingual transfer when fine-tuned on supervised English data only and evaluated on low resources language such as Swahili. We show that our unsupervised models can perform cross-lingual retrieval between different scripts, such as retrieving English documents from Arabic queries, which would not be possible with term matching methods.

preprint2020arXiv

A Memory Efficient Baseline for Open Domain Question Answering

Recently, retrieval systems based on dense representations have led to important improvements in open-domain question answering, and related tasks. While very effective, this approach is also memory intensive, as the dense vectors for the whole knowledge source need to be kept in memory. In this paper, we study how the memory footprint of dense retriever-reader systems can be reduced. We consider three strategies to reduce the index size: dimension reduction, vector quantization and passage filtering. We evaluate our approach on two question answering benchmarks: TriviaQA and NaturalQuestions, showing that it is possible to get competitive systems using less than 6Gb of memory.

preprint2016arXiv

Treeable Graphings Are Local Limits of Finite Graphs

Let $\mathbf G$ be a graphing, that is a Borel graph defined by $d$ measure preserving involutions. We prove that if $\mathbf G$ is {\em treeable} then it arises as the local limit of some sequence $(G_n)_{n\in\mathbb{N}}$ of graphs with maximum degree at most $d$. This extends a result by Elek [G. Elek, Note on limits of finite graphs, Combinatorica 27 (2007)] (for $\mathbf G$ a treeing) and consequently extends the domain of the graphings for which Aldous-Lyons conjecture is known to be true.

preprint2015arXiv

Limits of Structures and the Example of Tree-Semilattices

The notion of left convergent sequences of graphs introduced by Lov\' asz et al. (in relation with homomorphism densities for fixed patterns and Szemerédi's regularity lemma) got increasingly studied over the past $10$ years. Recently, Ne\v set\v ril and Ossona de Mendez introduced a general framework for convergence of sequences of structures. In particular, the authors introduced the notion of $QF$-convergence, which is a natural generalization of left-convergence. In this paper, we initiate study of $QF$-convergence for structures with functional symbols by focusing on the particular case of tree semi-lattices. We fully characterize the limit objects and give an application to the study of left convergence of $m$-partite cographs, a generalization of cographs.

Lucas Hosseini

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Improving Wikipedia Verifiability with AI

Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

Unsupervised Dense Information Retrieval with Contrastive Learning

A Memory Efficient Baseline for Open Domain Question Answering

Treeable Graphings Are Local Limits of Finite Graphs

Limits of Structures and the Example of Tree-Semilattices