CODE-SWITCHED RELATION EXTRACTION: A NOVEL DATASET AND TRAINING METHODOLOGY

Dr. Mingyu L. Chen; Muhammad Siddiqui

doi:10.55640/ijmcsit-v02i02-01

Authors

Dr. Mingyu L. Chen Department of Computer Science, Nanyang Technological University, Singapore
Muhammad Siddiqui Department of Computer Science, Nanyang Technological University, Singapore

DOI:

https://doi.org/10.55640/ijmcsit-v02i02-01

Keywords:

Code-switched text, relation extraction, multilingual NLP, dataset creation

Abstract

Relation Extraction (RE) is a fundamental task in Natural Language Processing (NLP) crucial for constructing knowledge graphs and enhancing information retrieval. While significant progress has been made in monolingual and cross-lingual RE, the unique challenges posed by code-switched (mix-lingual) text remain largely underexplored due to a scarcity of dedicated datasets and tailored methodologies. This paper introduces a novel, large-scale dataset specifically designed for code-switched relation extraction. Furthermore, we propose an effective training methodology tailored to capture the complexities of inter- and intra-sentential code-switching phenomena. Our comprehensive experiments demonstrate that this new dataset and the proposed approach significantly advance the state-of-the-art in extracting relations from mix-lingual content, providing a valuable resource and benchmark for future research in this challenging domain.

References

Zhang Y, Zhong V, Chen D, Angeli G, Manning C D. Position-aware attention and supervised data improve slot filling. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.35–45. DOI: 10.18653/v1/D17-1004.

Zeng X, Zeng D, He S, Liu K, Zhao J. Extracting relational facts by an end-to-end neural model with copy mechanism. In Proc. the 56th Annual Meeting of the Association for Computer Linguistics (Volume 1: Long Papers), Jul. 2018, pp.506–514. DOI: 10.18653/v1/P18-1047.

Gardent C, Shimorina A, Narayan S, Perez-Beltrachini L. Creating training corpora for NLG micro-planners. In Proc. the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2017, pp.179–188. DOI: 10.18653/v1/P17-1017.

Yao Y, Ye D, Li P, Han X, Lin Y, Liu Z, Liu Z, Huang L, Zhou J, Sun M. DocRED: A large-scale document-level relation extraction dataset. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.764–777. DOI: 10.18653/v1/P19-1074.

Luan Y, He L, Ostendorf M, Hajishirzi H. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, Oct. 31–Nov. 4, 2018, pp.3219–3232. DOI: 10.18653/v1/D18-1360.

Cheng Q, Liu J, Qu X, Zhao J, Liang J, Wang Z, Huai B, Yuan N J, Xiao Y. HacRED: A large-scale relation extraction dataset toward hard cases in practical applications. In Proc. the Association for Computational Linguistics: ACL-IJCNLP 2021, Aug. 2021, pp.2819–2831. DOI: 10.18653/v1/2021.findings-acl.249.

Zheng S, Wang F, Bao H, Hao Y, Zhou P, Xu B. Joint extraction of entities and relations based on a novel tagging scheme. In Proc. the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2017, pp.1227–1236. DOI: 10.18653/v1/P17-1113.

Wei Z, Su J, Wang Y, Tian Y, Chang Y. A novel cascade binary tagging framework for relational triple extraction. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.1476–1488. DOI: 10.18653/v1/2020.acl-main.136.

Zhong Z, Chen D. A frustratingly easy approach for entity and relation extraction. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.50–61. DOI: 10.18653/v1/2021.naacl-main.5.

Min B, Jiang Z, Freedman M, Weischedel R. Learning transferable representation for bilingual relation extraction via convolutional neural networks. In Proc. the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Nov. 2017, pp.674–684.

Ni J, Florian R. Neural cross-lingual relation extraction based on bilingual word embedding mapping. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 2019, pp.399–409. DOI: 10.18653/v1/D19-1038.

Winata G, Aji A F, Yong Z X, Solorio T. The decades progress on code-switching research in NLP: A systematic survey on trends and challenges. In Proc. the Findings of the Association for Computational Linguistics: ACL 2023, Jul. 2023, pp.2936–2978. DOI: 10.18653/v1/2023.findings-acl.185.

Winata G I, Madotto A, Wu C S, Fung P. Code-switched language models using neural based synthetic data from parallel sentences. In Proc. the 23rd Conference on Computational Natural Language Learning (CoNLL), Nov. 2019, pp.271–280. DOI: 10.18653/v1/K19-1026.

Kong L, Chu Y, Ma Z, Zhang J, He L, Chen J. MixRED: A mix-lingual relation extraction dataset. In Proc. the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, pp.11361–11370.

Zeng A, Xu B, Wang B et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. arXiv: 2406.12793, 2024. https://arxiv.org/abs/2406.12793, Sept. 2024.

Han X, Zhu H, Yu P, Wang Z, Yao Y, Liu Z, Sun M. FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, Oct. 31–Nov. 4, 2018, pp.4803–4809. DOI: 10.18653/v1/D18-1514.

Yang S, Choi M, Cho Y, Choo J. HistRED: A historical document-level relation extraction dataset. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1), Jul. 2023, pp.3207–3224. DOI: 10.18653/v1/2023.acl-long.180.

Li X L, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. In Proc. the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1), Aug. 2021, pp.4582–4597. DOI: 10.18653/v1/2021.acl-long.353.

Hu J E, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: Low-rank adaptation of large language models. In Proc. the 10th International Conference on Learning Representations, Apr. 2022.

Wan Z, Cheng F, Mao Z, Liu Q, Song H, Li J, Kurohashi S. GPT-RE: In-context learning for relation extraction using large language models. In Proc. the 2023 Conference on Empirical Methods in Natural Language Processing, Dec. 2023, pp.3534–3547. DOI: 10.18653/v1/2023.emnlp-main.214.

Li B, Fang G, Yang Y, Wang Q, Ye W, Zhao W, Zhang S. Evaluating ChatGPT’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv: 2304.11633, 2023. https://arxiv.org/abs/2304.11633, Sept. 2024.

Li X, Polat F, Groth P. Do instruction-tuned large language models help with relation extraction? In Proc. the 1st Workshop on Knowledge Base Construction from Pre-Trained Language Models (KBC-LM) and the 2nd Challenge on Language Models for Knowledge Base Construction (LM-KBC) Co-Located with the 22nd International Semantic Web Conference, Nov. 2023.

Poplack S. Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: Toward a typology of code-switching. Linguistics, 1980, 18(7/8): 581–618. DOI: 10.1515/ling.1980.18.7-8.581.

Mihalcea R, Tarau P. TextRank: Bringing order into text. In Proc. the 2004 Conference on Empirical Methods in Natural Language Processing, Jul. 2004, pp.404–411.

Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 1998, 30(1–7): 107–117. DOI: 10.1016/S0169-7552(98)00110-X.

Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 2019, pp.3982–3992. DOI: 10.18653/v1/D19-1410.

Wang Y, Chen M, Zhou W, Cai Y, Liang Y, Liu D, Yang B, Liu J, Hooi B. Should we rely on entity mentions for relation extraction? Debiasing relation extraction with counterfactual analysis. In Proc. the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul. 2022, pp.3071–3081. DOI: 10.18653/v1/2022.naacl-main.224.

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9.

Hendrickx I, Kim S N, Kozareva Z, Nakov P, Séaghdha D Ó, Padó S, Pennacchiotti M, Romano L, Szpakowicz S. SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proc. the 5th International Workshop on Semantic Evaluation, Jul. 2010, pp.33–38.

Doddington G, Mitchell A, Przybocki M, Ramshaw L, Strassel S, Weischedel R. The automatic content extraction (ACE) program – Tasks, data, and evaluation. In Proc. the 4th International Conference on Language Resources and Evaluation, May 2004.

Liu L, Li X, He R, Bing L, Joty S, Si L. Enhancing multilingual language model with massive multilingual knowledge triples. In Proc. the 2022 Conference on Empirical Methods in Natural Language Processing, Dec. 2022, pp.6878–6890. DOI: 10.18653/v1/2022.emnlp-main.462.

Nan G, Guo Z, Sekulic I, Lu W. Reasoning with latent structure refinement for document-level relation extraction. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.1546–1557. DOI: 10.18653/v1/2020.acl-main.141.

Zhou W, Huang K, Ma T, Huang J. Document-level relation extraction with adaptive thresholding and localized context pooling. In Proc. the 35th AAAI Conference on Artificial Intelligence, May 2021, pp.14612–14620. DOI: 10.1609/aaai.v35i16.17717.

Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.8440–8451. DOI: 10.18653/v1/2020.acl-main.747.

Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020.

Bai J, Bai S, Chu Y et al. Qwen technical report. arXiv: 2309.16609, 2023. https://arxiv.org/abs/2309.16609, Sept. 2024.

Muennighoff N, Wang T, Sutawika L et al. Crosslingual generalization through multitask finetuning. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.15991–16111. DOI: 10.18653/v1/2023.acl-long.891.

Yang A, Xiao B, Wang B et al. Baichuan 2: Open large-scale language models. arXiv: 2309.10305, 2023. https://arxiv.org/abs/2309.10305, Sept. 2024.

Touvron H, Martin L, Stone K et al. Llama 2: Open foundation and fine-tuned chat models. arXiv: 2307.09288, 2023. https://arxiv.org/abs/2307.09288, Sept. 2024.

International Journal of Modern Computer Science and IT Innovations

Article Details Page