International Journal of Advanced Artificial Intelligence Research

  1. Home
  2. Archives
  3. Vol. 2 No. 10 (2025): Volume 02 Issue 10
  4. Articles
International Journal of Advanced Artificial Intelligence Research

Article Details Page

A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics

Authors

  • Bagus Candra Department of Computer Science, Faculty of Engineering, Universitas Indonesia, Depok, Indonesia
  • Minh Thu Nguyen School of Electrical Engineering and Information Technology, Hanoi University of Science and Technology, Hanoi, Vietnam

Keywords:

Persian Natural Language Processing, Computational Linguistics, Python Toolkit, Transformer Models

Abstract

Purpose: This study introduces and comprehensively evaluates Shekar, an open-source Python toolkit engineered to address the persistent challenges in processing the morphologically rich and low-resource Persian language. The framework is specifically designed to bridge the gap between complex linguistic phenomena and the computational demands of state-of-the-art deep learning architectures.

Methods: Shekar's architecture emphasizes a modular and performance-optimized pipeline, featuring advanced Unicode normalization, novel subword tokenization strategies adapted from SentencePiece, and seamless integration layers for Transformer-based models such as ParsBERT and ALBERT. Empirical evaluation involved intrinsic analysis (tokenization throughput, POS-tagging accuracy on the Universal Dependencies Persian Treebank) and an extrinsic task (hate speech detection using the Naseza dataset) against established baseline toolkits.

Results: Shekar demonstrated significant performance enhancements across all evaluated metrics. The customized subword tokenization approach, essential for handling Persian’s expansive vocabulary and morphological richness, yielded an increase in tokenization throughput by $\sim 18\%$ compared to existing tools. Furthermore, when employed for data pre-processing in the extrinsic hate speech detection task, Shekar-processed input led to an average F1-score improvement of $4.1$ percentage points over conventional pre-processing chains, affirming the superior quality of the foundational linguistic analysis.

Conclusion: Shekar represents a crucial advancement for Persian computational linguistics, providing researchers and practitioners with an extensible, high-performance platform capable of fully leveraging modern deep learning models and large-scale corpora. Its design directly mitigates key challenges, positioning it as the recommended foundation for future Persian NLP research.

References

Amirivojdan, A. (2025). Naseza: A large-scale dataset for Persian hate speech and offensive language detection (Version v1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo. 17355123

Eslami, M., Atashgah, M. S., Alizadeh, L., & Zandi, T. (2004). Persian generative lexicon. The First Workshop on Persian Language and Computer. Tehran, Iran.

Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4

Durgam, S. (2025). CICD automation for financial data validation and deployment pipelines. Journal of Information Systems Engineering and Management, 10(45s), 645–664. https://doi.org/10.52783/jisem.v10i45s.8900

Jafari, S., Farsi, F., Ebrahimi, N., Sajadi, M. B., & Eetemadi, S. (2025). DadmaTools V2: An adapter-based natural language processing toolkit for the Persian language. Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script, 37–43.

Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv Preprint arXiv:1808.06226. https://doi.org/10.48550/arXiv.1808.06226

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv Preprint arXiv:1909.11942. https://doi.org/10.48550/arXiv.1909.11942

Mohtaj, S., Roshanfekr, B., Zafarian, A., & Asghari, H. (2018). Parsivar: A language processing toolkit for Persian. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (Lrec 2018).

Rasooli, M. S., Safari, P., Moloodi, A., & Nourian, A. (2020). The Persian dependency treebank made universal. arXiv Preprint arXiv:2009.10205. https://doi.org/10.48550/ arXiv.2009.10205

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1–20. https://doi.org/10. 1002/9780470689646.ch1

Sabouri, S., Rahmati, E., Gooran, S., & Sameti, H. (2022). Naab: A ready-to-use plug-and-play corpus for Farsi. arXiv Preprint arXiv:2208.13486. https://doi.org/10.22034/jaiai.2024. 480062.1016

Chandra, R. (2025). Security and privacy testing automation for LLM-enhanced applications in mobile devices. International Journal of Networks and Security, 5(2), 30–41. https://doi.org/10.55640/ijns-05-02-02

Samantapudi, R. K. R. (2025). Advantages & impact of fine tuning large language models for ecommerce search. Journal of Information Systems Engineering and Management, 10(45s), 600–622. https://doi.org/10.52783/jisem.v10i45s.8898

Downloads

Published

2025-10-29

How to Cite

A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics. (2025). International Journal of Advanced Artificial Intelligence Research, 2(10), 15-24. https://aimjournals.com/index.php/ijaair/article/view/309

How to Cite

A Comprehensive Evaluation Of Shekar: An Open-Source Python Framework For State-Of-The-Art Persian Natural Language Processing And Computational Linguistics. (2025). International Journal of Advanced Artificial Intelligence Research, 2(10), 15-24. https://aimjournals.com/index.php/ijaair/article/view/309

Similar Articles

11-20 of 22

You may also start an advanced similarity search for this article.