Source author record

Taraka Rama

Taraka Rama appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Artificial Intelligence Computation

Catalog footprint

What is connected

9works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

What do complexity measures measure? Correlating and validating corpus-based measures of morphological complexity

We present an analysis of eight measures used for quantifying morphological complexity of natural languages. The measures we study are corpus-based measures of morphological complexity with varying requirements for corpus annotation. We present similarities and differences between these measures visually and through correlation analyses, as well as their relation to the relevant typological variables. Our analysis focuses on whether these `measures' are measures of the same underlying variable, or whether they measure more than one dimension of morphological complexity. The principal component analysis indicates that the first principal component explains 92.62 % of the variation in eight measures, indicating a strong linear dependence between the complexity measures studied.

preprint2021arXiv

Are pre-trained text representations useful for multilingual and multi-dimensional language proficiency modeling?

Development of language proficiency models for non-native learners has been an active area of interest in NLP research for the past few years. Although language proficiency is multidimensional in nature, existing research typically considers a single "overall proficiency" while building models. Further, existing approaches also considers only one language at a time. This paper describes our experiments and observations about the role of pre-trained and fine-tuned multilingual embeddings in performing multi-dimensional, multilingual language proficiency classification. We report experiments with three languages -- German, Italian, and Czech -- and model seven dimensions of proficiency ranging from vocabulary control to sociolinguistic appropriateness. Our results indicate that while fine-tuned embeddings are useful for multilingual proficiency modeling, none of the features achieve consistently best performance for all dimensions of language proficiency. All code, data and related supplementary material can be found at: https://github.com/nishkalavallabhi/MultidimCEFRScoring.

preprint2016arXiv

Chinese Restaurant Process for cognate clustering: A threshold free approach

In this paper, we introduce a threshold free approach, motivated from Chinese Restaurant Process, for the purpose of cognate clustering. We show that our approach yields similar results to a linguistically motivated cognate clustering system known as LexStat. Our Chinese Restaurant Process system is fast and does not require any threshold and can be applied to any language family of the world.

preprint2016arXiv

Siamese convolutional networks based on phonetic features for cognate identification

In this paper, we explore the use of convolutional networks (ConvNets) for the purpose of cognate identification. We compare our architecture with binary classifiers based on string similarity measures on different language families. Our experiments show that convolutional networks achieve competitive results across concepts and across language families at the task of cognate identification.

preprint2014arXiv

Does Syntactic Knowledge help English-Hindi SMT?

In this paper we explore various parameter settings of the state-of-art Statistical Machine Translation system to improve the quality of the translation for a `distant' language pair like English-Hindi. We proposed new techniques for efficient reordering. A slight improvement over the baseline is reported using these techniques. We also show that a simple pre-processing step can improve the quality of the translation significantly.

preprint2014arXiv

Empirical Evaluation of Tree distances for Parser Evaluation

In this empirical study, I compare various tree distance measures -- originally developed in computational biology for the purpose of tree comparison -- for the purpose of parser evaluation. I will control for the parser setting by comparing the automatically generated parse trees from the state-of-the-art parser Charniak, 2000) with the gold-standard parse trees. The article describes two different tree distance measures (RF and QD) along with its variants (GRF and GQD) for the purpose of parser evaluation. The article will argue that RF measure captures similar information as the standard EvalB metric (Sekine and Collins, 1997) and the tree edit distance (Zhang and Shasha, 1989) applied by Tsarfaty et al. (2011). Finally, the article also provides empirical evidence by reporting high correlations between the different tree distances and EvalB metric's scores.

preprint2014arXiv

Gap-weighted subsequences for automatic cognate identification and phylogenetic inference

In this paper, we describe the problem of cognate identification and its relation to phylogenetic inference. We introduce subsequence based features for discriminating cognates from non-cognates. We show that subsequence based features perform better than the state-of-the-art string similarity measures for the purpose of cognate identification. We use the cognate judgments for the purpose of phylogenetic inference and observe that these classifiers infer a tree which is close to the gold standard tree. The contribution of this paper is the use of subsequence features for cognate identification and to employ the cognate judgments for phylogenetic inference.

preprint2014arXiv

Properties of phoneme N -grams across the world's language families

In this article, we investigate the properties of phoneme N-grams across half of the world's languages. We investigate if the sizes of three different N-gram distributions of the world's language families obey a power law. Further, the N-gram distributions of language families parallel the sizes of the families, which seem to obey a power law distribution. The correlation between N-gram distributions and language family sizes improves with increasing values of N. We applied statistical tests, originally given by physicists, to test the hypothesis of power law fit to twelve different datasets. The study also raises some new questions about the use of N-gram distributions in linguistic research, which we answer by running a statistical test.

preprint2014arXiv

Quantitative methods for Phylogenetic Inference in Historical Linguistics: An experimental case study of South Central Dravidian

In this paper we examine the usefulness of two classes of algorithms Distance Methods, Discrete Character Methods (Felsenstein and Felsenstein 2003) widely used in genetics, for predicting the family relationships among a set of related languages and therefore, diachronic language change. Applying these algorithms to the data on the numbers of shared cognates- with-change and changed as well as unchanged cognates for a group of six languages belonging to a Dravidian language sub-family given in Krishnamurti et al. (1983), we observed that the resultant phylogenetic trees are largely in agreement with the linguistic family tree constructed using the comparative method of reconstruction with only a few minor differences. Furthermore, we studied these minor differences and found that they were cases of genuine ambiguity even for a well-trained historical linguist. We evaluated the trees obtained through our experiments using a well-defined criterion and report the results here. We finally conclude that quantitative methods like the ones we examined are quite useful in predicting family relationships among languages. In addition, we conclude that a modest degree of confidence attached to the intuition that there could indeed exist a parallelism between the processes of linguistic and genetic change is not totally misplaced.

Taraka Rama

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

What do complexity measures measure? Correlating and validating corpus-based measures of morphological complexity

Are pre-trained text representations useful for multilingual and multi-dimensional language proficiency modeling?

Chinese Restaurant Process for cognate clustering: A threshold free approach

Siamese convolutional networks based on phonetic features for cognate identification

Does Syntactic Knowledge help English-Hindi SMT?

Empirical Evaluation of Tree distances for Parser Evaluation

Gap-weighted subsequences for automatic cognate identification and phylogenetic inference

Properties of phoneme N -grams across the world's language families

Quantitative methods for Phylogenetic Inference in Historical Linguistics: An experimental case study of South Central Dravidian