A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics
Keywords:
Persian Natural Language Processing, Computational Linguistics, Python Toolkit, Transformer ModelsAbstract
Purpose: This study introduces and comprehensively evaluates Shekar, an open-source Python toolkit engineered to address the persistent challenges in processing the morphologically rich and low-resource Persian language. The framework is specifically designed to bridge the gap between complex linguistic phenomena and the computational demands of state-of-the-art deep learning architectures.
Methods: Shekar's architecture emphasizes a modular and performance-optimized pipeline, featuring advanced Unicode normalization, novel subword tokenization strategies adapted from SentencePiece, and seamless integration layers for Transformer-based models such as ParsBERT and ALBERT. Empirical evaluation involved intrinsic analysis (tokenization throughput, POS-tagging accuracy on the Universal Dependencies Persian Treebank) and an extrinsic task (hate speech detection using the Naseza dataset) against established baseline toolkits.
Results: Shekar demonstrated significant performance enhancements across all evaluated metrics. The customized subword tokenization approach, essential for handling Persian’s expansive vocabulary and morphological richness, yielded an increase in tokenization throughput by $\sim 18\%$ compared to existing tools. Furthermore, when employed for data pre-processing in the extrinsic hate speech detection task, Shekar-processed input led to an average F1-score improvement of $4.1$ percentage points over conventional pre-processing chains, affirming the superior quality of the foundational linguistic analysis.
Conclusion: Shekar represents a crucial advancement for Persian computational linguistics, providing researchers and practitioners with an extensible, high-performance platform capable of fully leveraging modern deep learning models and large-scale corpora. Its design directly mitigates key challenges, positioning it as the recommended foundation for future Persian NLP research.
References
Amirivojdan, A. (2025). Naseza: A large-scale dataset for Persian hate speech and offensive language detection (Version v1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo. 17355123
Eslami, M., Atashgah, M. S., Alizadeh, L., & Zandi, T. (2004). Persian generative lexicon. The First Workshop on Persian Language and Computer. Tehran, Iran.
Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4
Durgam, S. (2025). CICD automation for financial data validation and deployment pipelines. Journal of Information Systems Engineering and Management, 10(45s), 645–664. https://doi.org/10.52783/jisem.v10i45s.8900
Jafari, S., Farsi, F., Ebrahimi, N., Sajadi, M. B., & Eetemadi, S. (2025). DadmaTools V2: An adapter-based natural language processing toolkit for the Persian language. Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 37–43.
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv Preprint arXiv:1808.06226. https://doi.org/10.48550/arXiv.1808.06226
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv Preprint arXiv:1909.11942. https://doi.org/10.48550/arXiv.1909.11942
Mohtaj, S., Roshanfekr, B., Zafarian, A., & Asghari, H. (2018). Parsivar: A language processing toolkit for Persian. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (Lrec 2018).
Rasooli, M. S., Safari, P., Moloodi, A., & Nourian, A. (2020). The Persian dependency treebank made universal. arXiv Preprint arXiv:2009.10205. https://doi.org/10.48550/ arXiv.2009.10205
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1–20. https://doi.org/10. 1002/9780470689646.ch1
Sabouri, S., Rahmati, E., Gooran, S., & Sameti, H. (2022). Naab: A ready-to-use plug-and-play corpus for Farsi. arXiv Preprint arXiv:2208.13486. https://doi.org/10.22034/jaiai.2024. 480062.1016
Chandra, R. (2025). Security and privacy testing automation for LLM-enhanced applications in mobile devices. International Journal of Networks and Security, 5(2), 30–41. https://doi.org/10.55640/ijns-05-02-02
Samantapudi, R. K. R. (2025). Advantages & impact of fine tuning large language models for ecommerce search. Journal of Information Systems Engineering and Management, 10(45s), 600–622. https://doi.org/10.52783/jisem.v10i45s.8898
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Bagus Candra, Minh Thu Nguyen (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.