Source author record

Juan-Manuel Torres-Moreno

Juan-Manuel Torres-Moreno appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Information Retrieval Artificial Intelligence cs.CY Information Theory math.CO math.IT math.OC Networking and Internet Architecture

Catalog footprint

What is connected

23works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Classifying several dialectal Nawatl varieties

Mexico is a country with a large number of indigenous languages, among which the most widely spoken is Nawatl, with more than two million people currently speaking it (mainly in North and Central America). Despite its rich cultural heritage, which dates back to the 15th century, Nawatl is a language with few computer resources. The problem is compounded when it comes to its dialectal varieties, with approximately 30 varieties recognised, not counting the different spellings in the written forms of the language. In this research work, we addressed the problem of classifying Nawatl varieties using Machine Learning and Neural Networks.

preprint2020arXiv

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with the goal of generating more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages: French, Portuguese and Spanish. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. In additional tests, which take advantage of the fact that the length of compressions can be modulated, we still improve ROUGE scores with shorter output sentences.

preprint2020arXiv

Audio Summarization with Audio Features and Probability Distribution Divergence

The automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached. It takes into account the segment's length, position and informativeness value. Informativeness of each segment is obtained by mapping a set of audio features issued from its Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Results over a multi-evaluator scheme shows that our approach provides understandable and informative summaries.

preprint2020arXiv

Automatic Discourse Segmentation: an evaluation in French

In this article, we describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic evaluations of these systems against the Annodis corpus, which is a manually annotated reference. The results obtained are very encouraging.

preprint2020arXiv

Automatic Discourse Segmentation: Review and Perspectives

Multilingual discourse parsing is a very prominent research topic. The first stage for discourse parsing is discourse segmentation. The study reported in this article addresses a review of two on-line available discourse segmenters (for English and Portuguese). We evaluate the possibility of developing similar discourse segmenters for Spanish, French and African languages.

preprint2020arXiv

Detecting New Word Meanings: A Comparison of Word Embedding Models in Spanish

Semantic neologisms (SN) are defined as words that acquire a new word meaning while maintaining their form. Given the nature of this kind of neologisms, the task of identifying these new word meanings is currently performed manually by specialists at observatories of neology. To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies: topic modeling, keyword extraction, and word sense disambiguation. The role of topic modeling is to detect the themes that are treated in the input text. Themes within a text give clues about the particular meaning of the words that are used, for example: viral has one meaning in the context of computer science (CS) and another when talking about health. To extract keywords, we used TextRank with POS tag filtering. With this method, we can obtain relevant words that are already part of the Spanish lexicon. We use a deep learning model to determine if a given keyword could have a new meaning. Embeddings that are different from all the known meanings (or topics) indicate that a word might be a valid SN candidate. In this study, we examine the following word embedding models: Word2Vec, Sense2Vec, and FastText. The models were trained with equivalent parameters using Wikipedia in Spanish as corpora. Then we used a list of words and their concordances (obtained from our database of neologisms) to show the different embeddings that each model yields. Finally, we present a comparison of these outcomes with the concordances of each word to show how we can determine if a word could be a valid candidate for SN.

preprint2020arXiv

Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding)

Standard informativeness measures used to evaluate Automatic Text Summarization mostly rely on n-gram overlapping between the automatic summary and the reference summaries. These measures differ from the metric they use (cosine, ROUGE, Kullback-Leibler, Logarithm Similarity, etc.) and the bag of terms they consider (single words, word n-grams, entities, nuggets, etc.). Recent word embedding approaches offer a continuous alternative to discrete approaches based on the presence/absence of a text unit. Informativeness measures have been extended to Focus Information Retrieval evaluation involving a user's information need represented by short queries. In particular for the task of CLEF-INEX Tweet Contextualization, tweet contents have been considered as queries. In this paper we define the concept of Interestingness as a generalization of Informativeness, whereby the information need is diverse and formalized as an unknown set of implicit queries. We then study the ability of state of the art Informativeness measures to cope with this generalization. Lately we show that with this new framework, standard word embeddings outperforms discrete measures only on uni-grams, however bi-grams seems to be a key point of interestingness evaluation. Lastly we prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results.

preprint2020arXiv

Generación automática de frases literarias en español

In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on statistical algorithms and shallow parsing analysis. We also present some rather encouraging preliminary results.

preprint2020arXiv

Intweetive Text Summarization

The amount of user generated contents from various social medias allows analyst to handle a wide view of conversations on several topics related to their business. Nevertheless keeping up-to-date with this amount of information is not humanly feasible. Automatic Summarization then provides an interesting mean to digest the dynamics and the mass volume of contents. In this paper, we address the issue of tweets summarization which remains scarcely explored. We propose to automatically generated summaries of Micro-Blogs conversations dealing with public figures E-Reputation. These summaries are generated using key-word queries or sample tweet and offer a focused view of the whole Micro-Blog network. Since state-of-the-art is lacking on this point we conduct and evaluate our experiments over the multilingual CLEF RepLab Topic-Detection dataset according to an experimental evaluation process.

preprint2020arXiv

LiSSS: A toy corpus of Spanish Literary Sentences for Emotions detection

In this work we present a new small data-set in Computational Creativity (CC) field, the Spanish Literary Sentences for emotions detection corpus (LISSS). We address this corpus of literary sentences in order to evaluate or design algorithms of emotions classification and detection. We have constitute this corpus by manually classifying the sentences in a set of emotions: Love, Fear, Happiness, Anger and Sadness/Pain. We also present some baseline classification algorithms applied on our corpus. The LISSS corpus will be available to the community as a free resource to evaluate or create CC-like algorithms.

preprint2020arXiv

Predicting Personalized Academic and Career Roads: First Steps Toward a Multi-Uses Recommender System

Nobody knows what one's do in the future and everyone will have had a different answer to the question : how do you see yourself in five years after your current job/diploma? In this paper we introduce concepts, large categories of fields of studies or job domains in order to represent the vision of the future of the user's trajectory. Then, we show how they can influence the prediction when proposing him a set of next steps to take.

preprint2020arXiv

Visual Simplified Characters' Emotion Emulator Implementing OCC Model

In this paper, we present a visual emulator of the emotions seen in characters in stories. This system is based on a simplified view of the cognitive structure of emotions proposed by Ortony, Clore and Collins (OCC Model). The goal of this paper is to provide a visual platform that allows us to observe changes in the characters' different emotions, and the intricate interrelationships between: 1) each character's emotions, 2) their affective relationships and actions, 3) The events that take place in the development of a plot, and 4) the objects of desire that make up the emotional map of any story. This tool was tested on stories with a contrasting variety of emotional and affective environments: Othello, Twilight, and Harry Potter, behaving sensibly and in keeping with the atmosphere in which the characters were immersed.

preprint2016arXiv

LIA-RAG: a system based on graphs and divergence of probabilities applied to Speech-To-Text Summarization

This paper aims to introduces a new algorithm for automatic speech-to-text summarization based on statistical divergences of probabilities and graphs. The input is a text from speech conversations with noise, and the output a compact text summary. Our results, on the pilot task CCCS Multiling 2015 French corpus are very encouraging

preprint2015arXiv

Optimisation using Natural Language Processing: Personalized Tour Recommendation for Museums

This paper proposes a new method to provide personalized tour recommendation for museum visits. It combines an optimization of preference criteria of visitors with an automatic extraction of artwork importance from museum information based on Natural Language Processing using textual energy. This project includes researchers from computer and social sciences. Some results are obtained with numerical experiments. They show that our model clearly improves the satisfaction of the visitor who follows the proposed tour. This work foreshadows some interesting outcomes and applications about on-demand personalized visit of museums in a very near future.

preprint2015arXiv

Regroupement sémantique de définitions en espagnol

This article focuses on the description and evaluation of a new unsupervised learning method of clustering of definitions in Spanish according to their semantic. Textual Energy was used as a clustering measure, and we study an adaptation of the Precision and Recall to evaluate our method.

preprint2015arXiv

Trivergence of Probability Distributions, at glance

In this paper we introduce the intuitive notion of trivergence of probability distributions (TPD). This notion allow us to calculate the similarity among triplets of objects. For this computation, we can use the well known measures of probability divergences like Kullback-Leibler and Jensen-Shannon. Divergence measures may be used in Information Retrieval tasks as Automatic Text Summarization, Text Classification, among many others.

preprint2015arXiv

Un résumeur à base de graphes, indépéndant de la langue

In this paper we present REG, a graph-based approach for study a fundamental problem of Natural Language Processing (NLP): the automatic text summarization. The algorithm maps a document as a graph, then it computes the weight of their sentences. We have applied this approach to summarize documents in three languages.

preprint2012arXiv

Artex is AnotheR TEXt summarizer

This paper describes Artex, another algorithm for Automatic Text Summarization. In order to rank sentences, a simple inner product is calculated between each sentence, a document vector (text topic) and a lexical vector (vocabulary used by a sentence). Summaries are then generated by assembling the highest ranked sentences. No ruled-based linguistic post-processing is necessary in order to obtain summaries. Tests over several datasets (coming from Document Understanding Conferences (DUC), Text Analysis Conferences (TAC), evaluation campaigns, etc.) in French, English and Spanish have shown that summarizer achieves interesting results.

preprint2012arXiv

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.

preprint2012arXiv

Condensés de textes par des méthodes numériques

Since information in electronic form is already a standard, and that the variety and the quantity of information become increasingly large, the methods of summarizing or automatic condensation of texts is a critical phase of the analysis of texts. This article describes CORTEX a system based on numerical methods, which allows obtaining a condensation of a text, which is independent of the topic and of the length of the text. The structure of the system enables it to find the abstracts in French or Spanish in very short times.

preprint2012arXiv

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical compressed sentences. The approach is simple enough to be transposed into other languages.

preprint2010arXiv

Improving Update Summarization by Revisiting the MMR Criterion

This paper describes a method for multi-document update summarization that relies on a double maximization criterion. A Maximal Marginal Relevance like criterion, modified and so called Smmr, is used to select sentences that are close to the topic and at the same time, distant from sentences used in already read documents. Summaries are then generated by assembling the high ranked material and applying some ruled-based linguistic post-processing in order to obtain length reduction and maintain coherency. Through a participation to the Text Analysis Conference (TAC) 2008 evaluation campaign, we have shown that our method achieves promising results.

preprint2010arXiv

Solving the Frequency Assignment Problem by Site Availability and Constraint Programming

The efficient use of bandwidth for radio communications becomes more and more crucial when developing new information technologies and their applications. The core issues are addressed by the so-called Frequency Assignment Problems (FAP). Our work investigates static FAP, where an attempt is first made to configure a kernel of links. We study the problem based on the concepts and techniques of Constraint Programming and integrate the site availability concept. Numerical simulations conducted on scenarios provided by CELAR are very promising.

Juan-Manuel Torres-Moreno

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

Classifying several dialectal Nawatl varieties

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Audio Summarization with Audio Features and Probability Distribution Divergence

Automatic Discourse Segmentation: an evaluation in French

Automatic Discourse Segmentation: Review and Perspectives

Detecting New Word Meanings: A Comparison of Word Embedding Models in Spanish

Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding)

Generación automática de frases literarias en español

Intweetive Text Summarization

LiSSS: A toy corpus of Spanish Literary Sentences for Emotions detection

Predicting Personalized Academic and Career Roads: First Steps Toward a Multi-Uses Recommender System

Visual Simplified Characters' Emotion Emulator Implementing OCC Model

LIA-RAG: a system based on graphs and divergence of probabilities applied to Speech-To-Text Summarization

Optimisation using Natural Language Processing: Personalized Tour Recommendation for Museums

Regroupement sémantique de définitions en espagnol

Trivergence of Probability Distributions, at glance

Un résumeur à base de graphes, indépéndant de la langue

Artex is AnotheR TEXt summarizer

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

Condensés de textes par des méthodes numériques

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Improving Update Summarization by Revisiting the MMR Criterion

Solving the Frequency Assignment Problem by Site Availability and Constraint Programming