Researcher profile

Nisheeth Joshi

Nisheeth Joshi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
24works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

24 published item(s)

preprint2022arXiv

Design and Development of Rule-based open-domain Question-Answering System on SQuAD v2.0 Dataset

Human mind is the palace of curious questions that seek answers. Computational resolution of this challenge is possible through Natural Language Processing techniques. Statistical techniques like machine learning and deep learning require a lot of data to train and despite that they fail to tap into the nuances of language. Such systems usually perform best on close-domain datasets. We have proposed development of a rule-based open-domain question-answering system which is capable of answering questions of any domain from a corresponding context passage. We have used 1000 questions from SQuAD 2.0 dataset for testing the developed system and it gives satisfactory results. In this paper, we have described the structure of the developed system and have analyzed the performance.

preprint2015arXiv

Classifier-Based Text Simplification for Improved Machine Translation

Machine Translation is one of the research fields of Computational Linguistics. The objective of many MT Researchers is to develop an MT System that produce good quality and high accuracy output translations and which also covers maximum language pairs. As internet and Globalization is increasing day by day, we need a way that improves the quality of translation. For this reason, we have developed a Classifier based Text Simplification Model for English-Hindi Machine Translation Systems. We have used support vector machines and Naïve Bayes Classifier to develop this model. We have also evaluated the performance of these classifiers.

preprint2014arXiv

Quality Estimation Of Machine Translation Outputs Through Stemming

Machine Translation is the challenging problem for Indian languages. Every day we can see some machine translators being developed, but getting a high quality automatic translation is still a very distant dream . The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system, which employs some machine learning techniques and morphological features. In ranking no human intervention is required. We have also validated our results by comparing it with human ranking.

preprint2014arXiv

Shiva: A Framework for Graph Based Ontology Matching

Since long, corporations are looking for knowledge sources which can provide structured description of data and can focus on meaning and shared understanding. Structures which can facilitate open world assumptions and can be flexible enough to incorporate and recognize more than one name for an entity. A source whose major purpose is to facilitate human communication and interoperability. Clearly, databases fail to provide these features and ontologies have emerged as an alternative choice, but corporations working on same domain tend to make different ontologies. The problem occurs when they want to share their data/knowledge. Thus we need tools to merge ontologies into one. This task is termed as ontology matching. This is an emerging area and still we have to go a long way in having an ideal matcher which can produce good results. In this paper we have shown a framework to matching ontologies using graphs.

preprint2014arXiv

Shiva++: An Enhanced Graph based Ontology Matcher

With the web getting bigger and assimilating knowledge about different concepts and domains, it is becoming very difficult for simple database driven applications to capture the data for a domain. Thus developers have come out with ontology based systems which can store large amount of information and can apply reasoning and produce timely information. Thus facilitating effective knowledge management. Though this approach has made our lives easier, but at the same time has given rise to another problem. Two different ontologies assimilating same knowledge tend to use different terms for the same concepts. This creates confusion among knowledge engineers and workers, as they do not know which is a better term then the other. Thus we need to merge ontologies working on same domain so that the engineers can develop a better application over it. This paper shows the development of one such matcher which merges the concepts available in two ontologies at two levels; 1) at string level and 2) at semantic level; thus producing better merged ontologies. We have used a graph matching technique which works at the core of the system. We have also evaluated the system and have tested its performance with its predecessor which works only on string matching. Thus current approach produces better results.

preprint2013arXiv

Analysing Quality of English-Hindi Machine Translation Engine Outputs Using Bayesian Classification

This paper considers the problem for estimating the quality of machine translation outputs which are independent of human intervention and are generally addressed using machine learning techniques.There are various measures through which a machine learns translations quality. Automatic Evaluation metrics produce good co-relation at corpus level but cannot produce the same results at the same segment or sentence level. In this paper 16 features are extracted from the input sentences and their translations and a quality score is obtained based on Bayesian inference produced from training data.

preprint2013arXiv

Automatic Ranking of MT Outputs using Approximations

Since long, research on machine translation has been ongoing. Still, we do not get good translations from MT engines so developed. Manual ranking of these outputs tends to be very time consuming and expensive. Identifying which one is better or worse than the others is a very taxing task. In this paper, we show an approach which can provide automatic ranks to MT outputs (translations) taken from different MT Engines and which is based on N-gram approximations. We provide a solution where no human intervention is required for ranking systems. Further we also show the evaluations of our results which show equivalent results as that of human ranking.

preprint2013arXiv

Development of a Hindi Lemmatizer

We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages. Learning all these languages is not at all possible; therefore we need a mechanism which can do this task for us. Machine translators have emerged as a tool which can perform this task. In order to develop a machine translator we need to develop several different rules. The very first module that comes in machine translation pipeline is morphological analysis. Stemming and lemmatization comes under morphological analysis. In this paper we have created a lemmatizer which generates rules for removing the affixes along with the addition of rules for creating a proper root word.

preprint2013arXiv

Development of Marathi Part of Speech Tagger Using Statistical Approach

Part-of-speech (POS) tagging is a process of assigning the words in a text corresponding to a particular part of speech. A fundamental version of POS tagging is the identification of words as nouns, verbs, adjectives etc. For processing natural languages, Part of Speech tagging is a prominent tool. It is one of the simplest as well as most constant and statistical model for many NLP applications. POS Tagging is an initial stage of linguistics, text analysis like information retrieval, machine translator, text to speech synthesis, information extraction etc. In POS Tagging we assign a Part of Speech tag to each word in a sentence and literature. Various approaches have been proposed to implement POS taggers. In this paper we present a Marathi part of speech tagger. It is morphologically rich language. Marathi is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using Unigram, Bigram, Trigram and HMM Methods. It presents a clear idea about all the algorithms with suitable examples. It also introduces a tag set for Marathi which can be used for tagging Marathi text. In this paper we have shown the development of the tagger as well as compared to check the accuracy of taggers output. The three Marathi POS taggers viz. Unigram, Bigram, Trigram and HMM gives the accuracy of 77.38%, 90.30%, 91.46% and 93.82% respectively.

preprint2013arXiv

HEVAL: Yet Another Human Evaluation Metric

Machine translation evaluation is a very important activity in machine translation development. Automatic evaluation metrics proposed in literature are inadequate as they require one or more human reference translations to compare them with output produced by machine translation. This does not always give accurate results as a text can have several different translations. Human evaluation metrics, on the other hand, lacks inter-annotator agreement and repeatability. In this paper we have proposed a new human evaluation metric which addresses these issues. Moreover this metric also provides solid grounds for making sound assumptions on the quality of the text produced by a machine translation.

preprint2013arXiv

Human and Automatic Evaluation of English-Hindi Machine Translation

For the past 60 years, Research in machine translation is going on. For the development in this field, a lot of new techniques are being developed each day. As a result, we have witnessed development of many automatic machine translators. A manager of machine translation development project needs to know the performance increase/decrease, after changes have been done in his system. Due to this reason, a need for evaluation of machine translation systems was felt. In this article, we shall present the evaluation of some machine translators. This evaluation will be done by a human evaluator and by some automatic evaluation metrics, which will be done at sentence, document and system level. In the end we shall also discuss the comparison between the evaluations.

preprint2013arXiv

Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration

Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.

preprint2013arXiv

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

This paper presents a novel approach to machine translation by combining the state of art name entity translation scheme. Improper translation of name entities lapse the quality of machine translated output. In this work, name entities are transliterated by using statistical rule based approach. This paper describes the translation and transliteration of name entities from English to Punjabi. We have experimented on four types of name entities which are: Proper names, Location names, Organization names and miscellaneous. Various rules for the purpose of syllabification have been constructed. Transliteration of name entities is accomplished with the help of Probability calculation. N-Gram probabilities for the extracted syllables have been calculated using statistical machine translation toolkit MOSES.

preprint2013arXiv

Part of Speech Tagging of Marathi Text Using Trigram Method

In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done.

preprint2013arXiv

Quality Estimation of English-Hindi Outputs using Naive Bayes Classifier

In this paper we present an approach for estimating the quality of machine translation system. There are various methods for estimating the quality of output sentences, but in this paper we focus on Naïve Bayes classifier to build model using features which are extracted from the input sentences. These features are used for finding the likelihood of each of the sentences of the training data which are then further used for determining the scores of the test data. On the basis of these scores we determine the class labels of the test data.

preprint2013arXiv

Rule Based Stemmer in Urdu

Urdu is a combination of several languages like Arabic, Hindi, English, Turkish, Sanskrit etc. It has a complex and rich morphology. This is the reason why not much work has been done in Urdu language processing. Stemming is used to convert a word into its respective root form. In stemming, we separate the suffix and prefix from the word. It is useful in search engines, natural language processing and word processing, spell checkers, word parsing, word frequency and count studies. This paper presents a rule based stemmer for Urdu. The stemmer that we have discussed here is used in information retrieval. We have also evaluated our results by verifying it with a human expert.

preprint2013arXiv

Rule Based Transliteration Scheme for English to Punjabi

Machine Transliteration has come out to be an emerging and a very important research area in the field of machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper transliteration of name entities plays a very significant role in improving the quality of machine translation. In this paper we are doing machine transliteration for English-Punjabi language pair using rule based approach. We have constructed some rules for syllabification. Syllabification is the process to extract or separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper names and location). For those words which do not come under the category of name entities, separate probabilities are being calculated by using relative frequency through a statistical machine translation toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to Punjabi.

preprint2013arXiv

Subjective and Objective Evaluation of English to Urdu Machine Translation

Machine translation is research based area where evaluation is very important phenomenon for checking the quality of MT output. The work is based on the evaluation of English to Urdu Machine translation. In this research work we have evaluated the translation quality of Urdu language which has been translated by using different Machine Translation systems like Google, Babylon and Ijunoon. The evaluation process is done by using two approaches - Human evaluation and Automatic evaluation. We have worked for both the approaches where in human evaluation emphasis is given to scales and parameters while in automatic evaluation emphasis is given to some automatic metric such as BLEU, GTM, METEOR and ATEC.

preprint2012arXiv

Design of English-Hindi Translation Memory for Efficient Translation

Developing parallel corpora is an important and a difficult activity for Machine Translation. This requires manual annotation by Human Translators. Translating same text again is a useless activity. There are tools available to implement this for European Languages, but no such tool is available for Indian Languages. In this paper we present a tool for Indian Languages which not only provides automatic translations of the previously available translation but also provides multiple translations, in cases where a sentence has multiple translations, in ranked list of suggestive translations for a sentence. Moreover this tool also lets translators have global and local saving options of their work, so that they may share it with others, which further lightens the task.

preprint2012arXiv

Evaluation of Computational Grammar Formalisms for Indian Languages

Natural Language Parsing has been the most prominent research area since the genesis of Natural Language Processing. Probabilistic Parsers are being developed to make the process of parser development much easier, accurate and fast. In Indian context, identification of which Computational Grammar Formalism is to be used is still a question which needs to be answered. In this paper we focus on this problem and try to analyze different formalisms for Indian languages.

preprint2012arXiv

Input Scheme for Hindi Using Phonetic Mapping

Written Communication on Computers requires knowledge of writing text for the desired language using Computer. Mostly people do not use any other language besides English. This creates a barrier. To resolve this issue we have developed a scheme to input text in Hindi using phonetic mapping scheme. Using this scheme we generate intermediate code strings and match them with pronunciations of input text. Our system show significant success over other input systems available.

preprint2012arXiv

OntoAna: Domain Ontology for Human Anatomy

Today, we can find many search engines which provide us with information which is more operational in nature. None of the search engines provide domain specific information. This becomes very troublesome to a novice user who wishes to have information in a particular domain. In this paper, we have developed an ontology which can be used by a domain specific search engine. We have developed an ontology on human anatomy, which captures information regarding cardiovascular system, digestive system, skeleton and nervous system. This information can be used by people working in medical and health care domain.

preprint2012arXiv

Plagiarism Detection: Keeping Check on Misuse of Intellectual Property

Today, Plagiarism has become a menace. Every journal editor or conference organizers has to deal with this problem. Simply Copying or rephrasing of text without giving due credit to the original author has become more common. This is considered to be an Intellectual Property Theft. We are developing a Plagiarism Detection Tool which would deal with this problem. In this paper we discuss the common tools available to detect plagiarism and their short comings and the advantages of our tool over these tools.