Graph explorer

Evaluating Persian Tokenizers

Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing tasks, like semantic parsing and language modeling. Natural Language Processing in Persian is challenging due to Persian's exceptional cases, such as half-spaces. Thus, it is crucial to have a precise tokenizer for Persian. This article provides a novel work by introducing the most widely used tokenizers for Persian and comparing and evaluating their performance on Persian texts using a simple algorithm with a pre-tagged Persian dependency dataset. After evaluating tokenizers with the F1-Score, the hybrid version of the Farsi Verb and Hazm with bounded morphemes fixing showed the best performance with an F1 score of 98.97%.

7 nodes7 linksoverview previewEvaluating Persian Tokenizers
7 nodes7 links
Evaluating Persian Tokenizers7 visible / 7 total nodes / 13 links
Related contextCo-authorshipCo-authorshipCo-authorshipCo-authorshipCo-authorshipCo-authorshipAuthorshipAuthorshipAuthorshipAuthorshipTopic signalTopic signalWEvaluating Persian Tokenizerspreprint / 2022ADanial KamaliResearcherABehrooz JanfadaResearcherAMohammad Ebrahim ShenasaResearcherABehrouz Minaei-BidgoliResearcherTArtificial Intelligence22915 worksTComputation and Language14115 works
PaperSignal 106 links

Evaluating Persian Tokenizers

preprint / 2022

Open