DEVELOPING AND VALIDATING A COMPREHENSIVE DISCOURSE ANNOTATION GUIDELINE FOR LOW-RESOURCE LANGUAGES

Prof. Kai O. Chen

Open Access

DEVELOPING AND VALIDATING A COMPREHENSIVE DISCOURSE ANNOTATION GUIDELINE FOR LOW-RESOURCE LANGUAGES

pdf

Prof. Kai O. Chen ¹ ,

⁴ School of Information Science and Technology, Tsinghua University, Beijing, China

Abstract

Background: The development of robust Natural Language Processing (NLP) systems for low-resource languages (LRLs) is severely hampered by a scarcity of annotated linguistic data, particularly for high-level structures like discourse. Existing annotation guidelines, often derived from English-centric frameworks like Rhetorical Structure Theory (RST), frequently prove ill-suited and yield low inter-annotator agreement (IAA) due to the non-isomorphic nature of discourse relations across disparate languages.

Methods: This study addresses the resource bottleneck by introducing a novel, simplified, and linguistically-adapted annotation guideline. We detail the iterative development process involving native speaker linguists, including a systematic schema pruning based on typological analysis and the principle of Functional Load. We propose a corpus creation methodology leveraging an Active Learning (AL) bootstrap strategy to efficiently prioritize $30\%$ of the most informative samples for human review. Guideline validation employed a two-tiered approach: quantitative IAA calculation ($\kappa$) and a qualitative analysis of annotator disagreement patterns to ensure high-fidelity refinement.

Results: Application of the guideline to a sample LRL corpus (LRL-A) demonstrated a reliable quantitative IAA ($\kappa$ > 0.75), which is competitive with published IAA figures for high-resource languages. The qualitative analysis confirmed that linguistic ambiguities specific to the LRL's implicit and functional markers were systematically addressed. Furthermore, the AL strategy provided a clear $30\%$ reduction in required annotation effort, optimizing limited resources.

Conclusion: The validated guideline provides a resource-efficient and adaptable framework for creating foundational discourse corpora for LRLs. The findings strongly suggest that simpler, function-based annotation schemas and AL techniques are essential for overcoming data scarcity and enhancing the transferability of discourse resources to underrepresented languages.

Keywords

Discourse Annotation, Low-Resource Languages (LRLs), Rhetorical Structure, Active Learning

References

📄 Adewoyin, R., Dutta, R., & He, Y. (2022). RSTGen: Imbuing fine-grained interpretable control into long-FormText generators. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, USA, pp. 1822–1835.

📄 Aldogan, D., & Yaslan, Y. (2015). A Comparison Study On Ensemble Strategies and Feature Sets for Sentiment Analysis. In Proceedings of the 30th International Symposium on Computer and Information Sciences, London, UK, pp. 359–370.

📄 Alós, J. (2015). Discourse relation recognition in translation: A relevance-theoretic perspective. Perspectives, 24(2), 201–217.

📄 Amidei, J., Piwek, P., & Willis, A. (2018). Rethinking the agreement in human evaluation tasks. In Proceedings of the 27th International Conference on Computational Linguistics, New Mexico, USA, pp. 3318–3329.

📄 Amidei, J., Piwek, P., & Willis, A. (2020). Identifying annotator bias: A new IRT-based method for bias identification. In Proceedings of the 28th International Conference on Computational Linguistics, Held Online, pp. 4787–4797.

📄 Androutsopoulos, I., Lampouras, G., & Galanis, D. (2013). Generating natural language descriptions from owl ontologies: The naturalowl system. Journal of Artificial Intelligence Research, 48, 671–715.

📄 Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2016). A hybrid approach to the sentiment analysis problem at the sentence level. Knowledge-Based Systems, 108, 110–124.

📄 Ariza-Casabona, A., Schmeisser-Nieto, W. S., Nofre, M., Taulé, M., Amigó, E., Chulvi, B., & Rosso, P. (2022). Overview of DETESTS at IberLEF 2022: DETEction and classification of racial STereotypes in Spanish. Procesamiento del lenguaje natural, 69, 217–2281.

📄 Asher, N., & Lascarides, A. (2003). Logics of Conversation. Studies in Natural Language Processing. Cambridge University Press. 526 pp.

📄 Braud, C., Hardmeier, C., Li, J. J., Loaiciga, M., & Zeldes, A. (eds.) (2022). Proceedings of the 3rd Workshop on Computational Approaches to Discourse, Gyeongju, Republic of Korea.

📄 Braud, C., Hardmeier, C., Li, J. J., Louis, A., & Strube, M. (eds.) (2020). Proceedings of the 1st Workshop on Computational Approaches to Discourse, Held Online.

📄 Braud, C., Hardmeier, C., Li, J. J., Louis, A., Strube, M., & Zeldes, A. (eds.) (2021). Proceedings of the 2nd Workshop on Computational Approaches to Discourse, Punta Cana, Dominican Republic.

📄 Bussmann, H. (1998). Routledge Dictionary of Language and Linguistics. Translated and edited by Gregory Trauth and Kerstin Kazzazi, London: Routledge.

📄 Carlson, L., & Marcu, D. (2001). Discourse tagging manual. Tech. rep. ISI-TR-545, 01–87.

📄 Castagnola, L. (2002). Anaphora resolution for question answering (Master’s thesis). Massachusetts Institute of Technology, Massachusetts, United States.

📄 Cieri, C., Maxwell, M., Strassel, S., & Tracey, J. (2016). Selection criteria for low resource language programs. In Proceedings of the 10th International Conference on Language Resources and Evaluation, Portorož, Slovenia, pp. 4543–4549.

📄 Devatine, N., Muller, P., & Braud, C. (2022). Predicting political orientation in news with latent discourse structure to improve bias understanding. In Proceedings of the 3rd Workshop on Computational Approaches to Discourse, Gyeongju, Republic of Korea and Online, pp. 77–85.

📄 Dreyfus, S., & Bennett, I. (2017). Circumstantiation: Taking a broader look at circumstantial meanings. Functional Linguistics, 1(4-5), 1–31.

📄 DuBois, J. W. (2003). Discourse and grammar. In Tomasello, M. (ed.), The New Psychology of Language: Cognitive and Functional Approaches to Language Structure, vol. 2, Lawrence Erlbaum Associates Publishers, pp. 47–87.

📄 Ducrot, O. (1987). O Dizer e o dito. Pontes, Campinas: 222 pp.

📄 Ducrot, O., Bruxelles, S., & Bourcier, D. (1980). Les mots du discours. les editions de minuit ed. France.

📄 Fairclough, N. (2003). Analysing Discourse: Textual Analysis for Social Research. London and New York: Routledge Taylor & Francis Group.

📄 Fawcett, R. P., & Davies, B. L. (1992). Monologue as a turn in dialogue: Towards an integration of Exchange Structure and Rhetorical Structure Theory. In Proceedings of the 6th International Workshop on Natural Language Generation, Trento, Italy, pp. 151–166.

📄 Fraser, B. (1999). What are discourse markers? Journal of Pragmatics, 31(7), 931–952.

📄 Grice, H. P. (1975). Logic and conversation. In Syntax and Semantics: Vol. 3: Speech Acts, New York, Speech Acts.

📄 Grosz, B. J. (1987). Whither discourse and speech acts?. In Wilks, Y. (ed), Theoretical Issues in Natural Language Processing, vol. 3.

📄 Grosz, B. J., & Sidner, C. L. (1986). Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3), 175–204.

📄 Guz, G., Bateni, P., Muglich, D., & Carenini, G. (2020). Neural RST-based evaluation of discourse coherence. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, pp. 664–671.

📄 Halliday, M. (1995). An Introduction to Functional Grammar, 1st ed. Arnold, London.

📄 Hengeveld, K., & Mackenzie, J. L. (2008). Functional Discourse Grammar: A Typologically Based Theory of Language Structure. Oxford Linguistics, Oxford.

📄 Hengeveld, K. (2004). Illocution, mood, and modality. In Booij, B., Lehmann, C., & Mugdan, J. (eds), Morphology: A Handbook On Inflection and Word Formation. 2nd ed. Berlin: Mouton de Gruyter, pp. 1190–1201.

📄 Hewett, F. (2023). APA-RST: A text simplification corpus with RST annotations. In Proceedings of the 4th Workshop on Computational Approaches to Discourse, Toronto, Canada, pp. 173–179.

📄 Hobbs, J. R. (1979). Coherence and coreference. Cognitive Science, 3(1), 67–90.

📄 Hou, S., Zhang, S., & Fei, C. (2020). Rhetorical structure theory: A comprehensive review of theory, parsing methods and applications. Expert Systems with Applications, 157, 113421.

📄 Hovy, E. (1992). A new level of language generation technology - capabilities and possibilities. IEEE Expert-Intelligent Systems & Their Applications, 7(2), 12–17.

📄 Hovy, E. (1993a). Automated discourse generation using discourse structure relations. Artificial Intelligence, 63(1-2), 341–385.

📄 Hovy, E. (1993b). In defense of syntax: Informational, intentional, and rhetorical structures in discourse. In Intentionality and Structure in Discourse Relations, pp. 35–39.

📄 Hovy, E. H. (1990). Parsimonious and profligate approaches to the question of discourse structure relations. In Proceedings of the 5th International Workshop on Natural Language Generation, Pennsylvania, USA, pp. 128–136.

📄 Huang, X. (2013). Applying a generic function-based topical relevance typology to structure clinical questions and answers. Journal of the American Society for Information Science and Technology, 64(1), 65–85.

📄 Isard, A. (2016). The methodius corpus of rhetorical discourse structures and generated texts. In Proceedings of the 10th International Conference on Language Resources and Evaluation, Portorož, Slovenia, pp. 1732–1736.

📄 Jurafsky, D. (2020). Discourse coherence. In Speech and Language Processing, Stanford University, pp. 01–25.

📄 Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall Series in Artificial Intelligence, 2nd ed. Pearson Education International, Prentice Hall, NJ.

📄 Khan, M., Ullah, K., Alharbi, Y., Alferaidi, A., Alharbi, T. S., Yadav, K., Alsharabi, N., & Ahmad, A. (2023). Understanding the research challenges in low-resource language and linking bilingual news articles in multilingual news archive. Applied Sciences, 13(15), 8566.

📄 Kim, Y.-B. (2001). Concession and linguistic inference. In Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation, Jeju, Korea, pp. 187–194.

📄 Lei, Y., Huang, R., Wang, L., & Beauchamp, N. (2022). Sentence-level media bias analysis informed by discourse structures. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, pp. 10040–10050.

📄 Li, J., Li, R., & Hovy, E. (2014). Recursive deep models for discourse parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 2061–2069.

📄 Li, J., Sun, A., & Joty, S. (2018). Segbot: A generic neural text segmentation model with pointer network. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4166–4172.

📄 Li, Z., Wu, W., & Li, S. (2020). Composing elementary discourse units in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Held Online, pp. 6191–6196.

📄 Mabona, A., Rimell, L., Clark, S., & Vlachos, A. (2019). Neural generative rhetorical structure parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 2284–2295.

📄 Mann, W. C. (1984). Discourse structures for text generation. In 10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics, Stanford, California, pp. 367–375.

📄 Mann, W. C., Matthiessen, C. M. I. M., & Thompson, S. A. (1992). Rhetorical structure theory and text analysis. In Discourse Description: Diverse Linguistic Analyses of a Fund-Raising Text. Amsterdam and Philadelphia: John Benjamins, pp. 39–78.

📄 Mann, W. C., & Thompson, S. A. (1987). Rhetorical Structure Theory: A Theory of Text Organization. Technical Report. RS-87-190, Information Sciences Institute. University of Southern California, Los Angeles, USA. pp. 1–82.

📄 Marcu, D. (2000). The rhetorical parsing of unrestricted texts: a surface-based approach. Computational Linguistics, 26(3), 395–448.

📄 Marcu, D., & Echihabi, A. (2002). An unsupervised approach to recognizing discourse relations. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, USA, pp. 368–375.

📄 Martin, J. R. (1992). English Text: System and structure. John Benjamins, Amsterdam.

📄 Megerdoomian, K., & Parvaz, D. (2008). Low-density language bootstrapping: the case of Tajiki Persian. In Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 3293–3298.

📄 Moore, J. D., & Wiemer-Hastings, P. (2003). Discourse in Computational Linguistics and Artificial Intelligence. In Handbook of Discourse Processes, 1st ed., University of Edinburgh, West.

📄 Mukherjee, S., & Joshi, S. (2013). Sentiment aggregation using ConceptNet ontology. In 6th International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 570–578.

📄 Naismith, B., Mulcaire, P., & Burstein, J. (2023). Automated evaluation of written discourse coherence using GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, Toronto, Canada, pp. 394–403.

📄 Nunan, D. (1993). Introducing Discourse Analysis. London: Penguin English.

📄 Passonneau, R. J., & Litman, D. J. (1997). Discourse segmentation by human and automated means. Computational Linguistics, 23(1), 103–139.

📄 Potter, A. (2018). Reasoning between the lines: A logic of relational propositions. Dialogue & Discourse, 9(2).

📄 Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse TreeBank 2.0. Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 2961–2968.

📄 Prevot, L., Hunter, J., & Muller, P. (2023). Comparing methods for segmenting elementary discourse units in a French conversational corpus. In Alumäe, T., & Fishel, M. (eds), Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Tórshavn, Faroe Islands, University of Tartu Library, pp. 436–446.

📄 Ramsay, A. (2005). Discourse. In Mitkov, R. (ed), The Oxford Handbook of Computational Linguistics, vol. 1, Oxford University Presss, Inc, pp. 112–135.

📄 Rohde, H., Johnson, A., Schneider, N., & Webber, B. (2018). Discourse coherence: Concurrent explicit and implicit relations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2257–2267.

📄 Sampson, G., & Babarczy, A. (2008). Definitional and human constraints on structural annotation of English. Natural Language Engineering, 14(4), 471–494.

📄 Stede, M., Taboada, M., & Das, D. (2017). Annotation Guidelines for Rhetorical Structure. Linguistics Department at The University of Potsdam, pp. 1–31.

📄 Strube, M., Braud, C., Hardmeier, C., Li, J. J., Loaiciga, S., & Zeldes, A. (eds.). Proceedings of the 4th Workshop on Computational Approaches to Discourse, Toronto, Canada.

📄 Sweetser, E. (1990). From Etymology to Pragmatics: Metaphorical and Cultural Aspects of Semantic Structure. Cambridge Studies in Linguistics. Cambridge University Press.

📄 Thompson, S. A., & Mann, W. C. (1988). Rhetorical structure theory: A framework for the analysis of texts. IPRA Papers in Pragmatics, 1, 79–105.

📄 Trnavac, R., Das, D., & Taboada, M. (2016). Discourse relations and evaluation. Corpora, 11(2), 169–190.

📄 Tseronis, A. (2011). From connectives to argumentative markers: A quest for markers of argumentative moves and of related aspects of argumentative discourse. Argumentation: an International Journal on Reasoning, 25(4), 427–447.

📄 Vargas, F., Benevenuto, F., & Pardo, T. (2021). Toward discourse-aware models for multilingual fake news detection. In Proceedings of the International Conference Recent Advances in Natural Language Processing - Student Research Workshop, Held Online, pp. 210–218.

📄 Vargas, F., D’Alessandro, J., Rabinovich, Z., Benevenuto, F., & Pardo, T. (2022). Rhetorical structure approach for online deception detection: A survey. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi S., Isahara H., Maegaard B., Mariani J., Mazo H., Odijk J., & Piperidis S. (eds), Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, European Language Resources Association, pp. 5906–5915.

📄 Wan, S., Kutschbach, T., Lüdeling, A., & Stede, M. (2019). RST-tace a tool for automatic comparison and evaluation of RST trees. In Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, Minneapolis, USA, pp. 88–96.

📄 Wiebe, J., Wilson, T., Bruce, R., Bell, M., & Martin, M. (2004). Learning subjective language. Computational Linguistics, 30(3), 277–308.

📄 Wiemerslage, A., Silfverberg, M., Yang, C., McCarthy, A., Nicolai, G., Colunga, E., & Kann, K. (2022). Morphological processing of low-resource languages: Where we are and what’s next. In Findings of the Association for Computational Linguistics. Dublin, Ireland, pp. 988–1007.

📄 Xu, J., Gan, Z., Cheng, Y., & Liu, J. (2020). Discourse-aware neural extractive text summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Held Online, pp. 5021–5031.

📄 Rangu, S. (2025). Analyzing the impact of AI-powered call center automation on operational efficiency in healthcare. Journal of Information Systems Engineering and Management, 10(45s), 666–689. https://doi.org/10.55278/jisem.2025.10.45s.666

📄 Jain, R., Sai Santosh Goud Bandari, & Naga Sai Mrunal Vuppala. (2025). Polynomial Regression Techniques in Insurance Claims Forecasting. International Journal of Computational and Experimental Science and Engineering, 11(3). https://doi.org/10.22399/ijcesen.3519

International Journal of Intelligent Data and Machine Learning

DEVELOPING AND VALIDATING A COMPREHENSIVE DISCOURSE ANNOTATION GUIDELINE FOR LOW-RESOURCE LANGUAGES

Abstract

Keywords

References

Similar Articles