A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics

Bagus Candra; Minh Thu Nguyen

Open Access

A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics

pdf

Bagus Candra ¹ , Minh Thu Nguyen ¹ ,

⁴ Department of Computer Science, Faculty of Engineering, Universitas Indonesia, Depok, Indonesia

⁴ School of Electrical Engineering and Information Technology, Hanoi University of Science and Technology, Hanoi, Vietnam

Abstract

Purpose: This study introduces and comprehensively evaluates Shekar, an open-source Python toolkit engineered to address the persistent challenges in processing the morphologically rich and low-resource Persian language. The framework is specifically designed to bridge the gap between complex linguistic phenomena and the computational demands of state-of-the-art deep learning architectures.

Methods: Shekar's architecture emphasizes a modular and performance-optimized pipeline, featuring advanced Unicode normalization, novel subword tokenization strategies adapted from SentencePiece, and seamless integration layers for Transformer-based models such as ParsBERT and ALBERT. Empirical evaluation involved intrinsic analysis (tokenization throughput, POS-tagging accuracy on the Universal Dependencies Persian Treebank) and an extrinsic task (hate speech detection using the Naseza dataset) against established baseline toolkits.

Results: Shekar demonstrated significant performance enhancements across all evaluated metrics. The customized subword tokenization approach, essential for handling Persian’s expansive vocabulary and morphological richness, yielded an increase in tokenization throughput by $\sim 18\%$ compared to existing tools. Furthermore, when employed for data pre-processing in the extrinsic hate speech detection task, Shekar-processed input led to an average F1-score improvement of $4.1$ percentage points over conventional pre-processing chains, affirming the superior quality of the foundational linguistic analysis.

Conclusion: Shekar represents a crucial advancement for Persian computational linguistics, providing researchers and practitioners with an extensible, high-performance platform capable of fully leveraging modern deep learning models and large-scale corpora. Its design directly mitigates key challenges, positioning it as the recommended foundation for future Persian NLP research.

Keywords

Persian Natural Language Processing, Computational Linguistics, Python Toolkit, Transformer Models

References

📄 Amirivojdan, A. (2025). Naseza: A large-scale dataset for Persian hate speech and offensive language detection (Version v1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo. 17355123

📄 Eslami, M., Atashgah, M. S., Alizadeh, L., & Zandi, T. (2004). Persian generative lexicon. The First Workshop on Persian Language and Computer. Tehran, Iran.

📄 Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4

📄 Durgam, S. (2025). CICD automation for financial data validation and deployment pipelines. Journal of Information Systems Engineering and Management, 10(45s), 645–664. https://doi.org/10.52783/jisem.v10i45s.8900

📄 Jafari, S., Farsi, F., Ebrahimi, N., Sajadi, M. B., & Eetemadi, S. (2025). DadmaTools V2: An adapter-based natural language processing toolkit for the Persian language. Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 37–43.

📄 Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv Preprint arXiv:1808.06226. https://doi.org/10.48550/arXiv.1808.06226

📄 Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv Preprint arXiv:1909.11942. https://doi.org/10.48550/arXiv.1909.11942

📄 Mohtaj, S., Roshanfekr, B., Zafarian, A., & Asghari, H. (2018). Parsivar: A language processing toolkit for Persian. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (Lrec 2018).

📄 Rasooli, M. S., Safari, P., Moloodi, A., & Nourian, A. (2020). The Persian dependency treebank made universal. arXiv Preprint arXiv:2009.10205. https://doi.org/10.48550/ arXiv.2009.10205

📄 Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1–20. https://doi.org/10. 1002/9780470689646.ch1

📄 Sabouri, S., Rahmati, E., Gooran, S., & Sameti, H. (2022). Naab: A ready-to-use plug-and-play corpus for Farsi. arXiv Preprint arXiv:2208.13486. https://doi.org/10.22034/jaiai.2024. 480062.1016

📄 Chandra, R. (2025). Security and privacy testing automation for LLM-enhanced applications in mobile devices. International Journal of Networks and Security, 5(2), 30–41. https://doi.org/10.55640/ijns-05-02-02

📄 Samantapudi, R. K. R. (2025). Advantages & impact of fine tuning large language models for ecommerce search. Journal of Information Systems Engineering and Management, 10(45s), 600–622. https://doi.org/10.52783/jisem.v10i45s.8898

International Journal of Advanced Artificial Intelligence Research

A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics

Abstract

Keywords

References

Similar Articles