A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics
Abstract
Purpose: This study introduces and comprehensively evaluates Shekar, an open-source Python toolkit engineered to address the persistent challenges in processing the morphologically rich and low-resource Persian language. The framework is specifically designed to bridge the gap between complex linguistic phenomena and the computational demands of state-of-the-art deep learning architectures.
Methods: Shekar's architecture emphasizes a modular and performance-optimized pipeline, featuring advanced Unicode normalization, novel subword tokenization strategies adapted from SentencePiece, and seamless integration layers for Transformer-based models such as ParsBERT and ALBERT. Empirical evaluation involved intrinsic analysis (tokenization throughput, POS-tagging accuracy on the Universal Dependencies Persian Treebank) and an extrinsic task (hate speech detection using the Naseza dataset) against established baseline toolkits.
Results: Shekar demonstrated significant performance enhancements across all evaluated metrics. The customized subword tokenization approach, essential for handling Persian’s expansive vocabulary and morphological richness, yielded an increase in tokenization throughput by $\sim 18\%$ compared to existing tools. Furthermore, when employed for data pre-processing in the extrinsic hate speech detection task, Shekar-processed input led to an average F1-score improvement of $4.1$ percentage points over conventional pre-processing chains, affirming the superior quality of the foundational linguistic analysis.
Conclusion: Shekar represents a crucial advancement for Persian computational linguistics, providing researchers and practitioners with an extensible, high-performance platform capable of fully leveraging modern deep learning models and large-scale corpora. Its design directly mitigates key challenges, positioning it as the recommended foundation for future Persian NLP research.
Keywords
References
Similar Articles
- Dr. Mateo Alvarez, Integrative Perspectives On Identity, Authentication, And Privacy: From RFID Security Protocols To Facial Biometric Representations , International Journal of Advanced Artificial Intelligence Research: Vol. 3 No. 01 (2026): Volume 03 Issue 01
You may also start an advanced similarity search for this article.