Researcher profile

Wlodek Zadrozny

Wlodek Zadrozny contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - Emerging
10works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2020arXiv

A Novel Method of Extracting Topological Features from Word Embeddings

In recent years, topological data analysis has been utilized for a wide range of problems to deal with high dimensional noisy data. While text representations are often high dimensional and noisy, there are only a few work on the application of topological data analysis in natural language processing. In this paper, we introduce a novel algorithm to extract topological features from word embedding representation of text that can be used for text classification. Working on word embeddings, topological data analysis can interpret the embedding high-dimensional space and discover the relations among different embedding dimensions. We will use persistent homology, the most commonly tool from topological data analysis, for our experiment. Examining our topological algorithm on long textual documents, we will show our defined topological features may outperform conventional text mining features.

preprint2020arXiv

Causal Knowledge Extraction from Scholarly Papers in Social Sciences

The scale and scope of scholarly articles today are overwhelming human researchers who seek to timely digest and synthesize knowledge. In this paper, we seek to develop natural language processing (NLP) models to accelerate the speed of extraction of relationships from scholarly papers in social sciences, identify hypotheses from these papers, and extract the cause-and-effect entities. Specifically, we develop models to 1) classify sentences in scholarly documents in business and management as hypotheses (hypothesis classification), 2) classify these hypotheses as causal relationships or not (causality classification), and, if they are causal, 3) extract the cause and effect entities from these hypotheses (entity extraction). We have achieved high performance for all the three tasks using different modeling techniques. Our approach may be generalizable to scholarly documents in a wide range of social sciences, as well as other types of textual materials.

preprint2020arXiv

Computing Conceptual Distances between Breast Cancer Screening Guidelines: An Implementation of a Near-Peer Epistemic Model of Medical Disagreement

Using natural language processing tools, we investigate the differences of recommendations in medical guidelines for the same decision problem -- breast cancer screening. We show that these differences arise from knowledge brought to the problem by different medical societies, as reflected in the conceptual vocabularies used by the different groups of authors.The computational models we build and analyze agree with the near-peer epistemic model of expert disagreement proposed by Garbayo. Even though the article is a case study focused on one set of guidelines, the proposed methodology is broadly applicable. In addition to proposing a novel graph-based similarity model for comparing collections of documents, we perform an extensive analysis of the model performance. In a series of a few dozen experiments, in three broad categories, we show, at a very high statistical significance level of 3-4 standard deviations for our best models, that the high similarity between expert annotated model and our concept based, automatically created, computational models is not accidental. Our best model achieves roughly 70% similarity. We also describe possible extensions of this work.

preprint2020arXiv

Topological Data Analysis in Text Classification: Extracting Features with Additive Information

While the strength of Topological Data Analysis has been explored in many studies on high dimensional numeric data, it is still a challenging task to apply it to text. As the primary goal in topological data analysis is to define and quantify the shapes in numeric data, defining shapes in the text is much more challenging, even though the geometries of vector spaces and conceptual spaces are clearly relevant for information retrieval and semantics. In this paper, we examine two different methods of extraction of topological features from text, using as the underlying representations of words the two most popular methods, namely word embeddings and TF-IDF vectors. To extract topological features from the word embedding space, we interpret the embedding of a text document as high dimensional time series, and we analyze the topology of the underlying graph where the vertices correspond to different embedding dimensions. For topological data analysis with the TF-IDF representations, we analyze the topology of the graph whose vertices come from the TF-IDF vectors of different blocks in the textual document. In both cases, we apply homological persistence to reveal the geometric structures under different distance resolutions. Our results show that these topological features carry some exclusive information that is not captured by conventional text mining methods. In our experiments we observe adding topological features to the conventional features in ensemble models improves the classification results (up to 5\%). On the other hand, as expected, topological features by themselves may be not sufficient for effective classification. It is an open problem to see whether TDA features from word embeddings might be sufficient, as they seem to perform within a range of few points from top results obtained with a linear support vector classifier.

preprint2020arXiv

UNCC Biomedical Semantic Question Answering Systems. BioASQ: Task-7B, Phase-B

In this paper, we detail our submission to the 2019, 7th year, BioASQ competition. We present our approach for Task-7b, Phase B, Exact Answering Task. These Question Answering (QA) tasks include Factoid, Yes/No, List Type Question answering. Our system is based on a contextual word embedding model. We have used a Bidirectional Encoder Representations from Transformers(BERT) based system, fined tuned for biomedical question answering task using BioBERT. In the third test batch set, our system achieved the highest MRR score for Factoid Question Answering task. Also, for List type question answering task our system achieved the highest recall score in the fourth test batch set. Along with our detailed approach, we present the results for our submissions, and also highlight identified downsides for our current approach and ways to improve them in our future experiments.

preprint2014arXiv

Watsonsim: Overview of a Question Answering Engine

The objective of the project is to design and run a system similar to Watson, designed to answer Jeopardy questions. In the course of a semester, we developed an open source question answering system using the Indri, Lucene, Bing and Google search engines, Apache UIMA, Open- and CoreNLP, and Weka among additional modules. By the end of the semester, we achieved 18% accuracy on Jeopardy questions, and work has not stopped since then.

preprint1996arXiv

Natural Language Processing: Structure and Complexity

We introduce a method for analyzing the complexity of natural language processing tasks, and for predicting the difficulty new NLP tasks. Our complexity measures are derived from the Kolmogorov complexity of a class of automata --- {\it meaning automata}, whose purpose is to extract relevant pieces of information from sentences. Natural language semantics is defined only relative to the set of questions an automaton can answer. The paper shows examples of complexity estimates for various NLP programs and tasks, and some recipes for complexity management. It positions natural language processing as a subdomain of software engineering, and lays down its formal foundation.

preprint1995arXiv

Context and ontology in understanding of dialogs

We present a model of NLP in which ontology and context are directly included in a grammar. The model is based on the concept of {\em construction}, consisting of a set of features of form, a set of semantic and pragmatic conditions describing its application context, and a description of its meaning. In this model ontology is embedded into the grammar; e.g. the hierarchy of {\it np} constructions is based on the corresponding ontology. Ontology is also used in defining contextual parameters; e.g. $\left[ current\_question \ time(\_) \right] $. A parser based on this model allowed us to build a set of dialog understanding systems that include an on-line calendar, a banking machine, and an insurance quote system. The proposed approach is an alternative to the standard "pipeline" design of morphology-syntax-semantics-pragmatics; the account of meaning conforms to our intuitions about compositionality, but there is no homomorphism from syntax to semantics.

preprint1995arXiv

NL Understanding with a Grammar of Constructions

We present an approach to natural language understanding based on a computable grammar of constructions. A "construction" consists of a set of features of form and a description of meaning in a context. A grammar is a set of constructions. This kind of grammar is the key element of Mincal, an implemented natural language, speech-enabled interface to an on-line calendar system. The system consists of a NL grammar, a parser, an on-line calendar, a domain knowledge base (about dates, times and meetings), an application knowledge base (about the calendar), a speech recognizer, a speech generator, and the interfaces between those modules. We claim that this architecture should work in general for spoken interfaces in small domains. In this paper we present two novel aspects of the architecture: (a) the use of constructions, integrating descriptions of form, meaning and context into one whole; and (b) the separation of domain knowledge from application knowledge. We describe the data structures for encoding constructions, the structure of the knowledge bases, and the interactions of the key modules of the system.

preprint1995arXiv

The Compactness of Construction Grammars

We present an argument for {\em construction grammars} based on the minimum description length (MDL) principle (a formal version of the Ockham Razor). The argument consists in using linguistic and computational evidence in setting up a formal model, and then applying the MDL principle to prove its superiority with respect to alternative models. We show that construction-based representations are at least an order of magnitude more compact that the corresponding lexicalized representations of the same linguistic data. The result is significant for our understanding of the relationship between syntax and semantics, and consequently for choosing NLP architectures. For instance, whether the processing should proceed in a pipeline from syntax to semantics to pragmatics, and whether all linguistic information should be combined in a set of constraints. From a broader perspective, this paper does not only argue for a certain model of processing, but also provides a methodology for determining advantages of different approaches to NLP.