Arabic Nested Noun Compound Extraction Based on Linguistic Features and Statistical Measures

Authors

  • Nazlia Omar Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia
  • Qasem Al-Tashi Computer and Information Sciences, Universiti Teknologi PETRONAS

DOI:

https://doi.org/10.17576/gema-2018-1802-07

Keywords:

Arabic multi-word expressions, noun compound, nested noun compound, association measures, POS tagging

Abstract

The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%, 88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively. 

Author Biographies

Nazlia Omar, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia

Nazlia Omar is currently an Associate Professor at the Centre for AI Technology, Faculty of Information Science and Technology, Universiti Kebangsaan, Malaysia (UKM). She holds her PhD from the University of Ulster, UK. Her main research interest lies in the area of Natural Language Processing and Computational Linguistics.  

Qasem Al-Tashi, Computer and Information Sciences, Universiti Teknologi PETRONAS

Qasem Al-Tashi is currently doing his PhD on Information technology at the Universiti Teknologi PETRONAS. He holds his Master from Universiti Kebangsaan, Malaysia (UKM). He attained his bachelor’s degree in computer science at University Technology Malaysia (UTM). His main research interest is in the area of Artificial Intelligence, Swarm Intelligence,  Feature Selection, Natural Language Processing and Data Mining.

References

Al-Balushi, H., Ab Aziz, M. J., Vidyavathi, K., Sabeenian, R. S., Selvavinayaki, K., Karthikeyan, D. R. E., … Others. (2014). A Hybrid Method Of Linguistic Approach And Statistical Method For Nested Noun Compound Extraction. Journal of Theoretical and Applied Information Technology. Vol. 67(3).

Al-Mashhadani, M. & Omar, N. (2015). Extraction Of Arabic Nested Noun Compounds Based On A Hybrid Method Of Linguistic Approach And Statistical Methods. Journal of Theoretical and Applied Information Technology. Vol. 76(3), 408-416.

Albared, M., Al-Moslmi, T., Omar, N., Al-Shabi, A. & Ba-Alwi, F. M. (2016). Probabilistic Arabic Part Of Speech Tagger With Unknown Words Handling. Journal of Theoretical and Applied Information Technology. Vol. 90(2), 236.

Aliwy, A. H. (2012). Tokenization as Preprocessing for Arabic Tagging System. International Journal of Information and Education Technology. Vol. 2(4), 348.

Attia, M., Tounsi, L., Pecina, P., van Genabith, J. & Toral, A. (2010). Automatic Extraction Of Arabic Multiword Expressions. In Proceedings of the Multiword Expressions: From Theory to Applications (MWE 2010), pages 19–27,. Beijing, August 2010

Boujelben, I., Mesfar, S. & Hamadou, A. Ben. (2010). Arabic Compound Nouns Processing: Inflection And Tokenization. In Proceedings of Nooj International Conference (p. 40).

Buckeridge, A. M. & Sutcliffe, R. F. E. (2002). Disambiguating Noun Compounds With Latent Semantic Indexing. In COLING-02 on COMPUTERM 2002: second international workshop on computational terminology-Volume 14 (pp. 1-7).

Diab, M., Hacioglu, K., & Jurafsky, D. (2004). Automatic Tagging Of Arabic Text: From Raw Text To Base Phrase Chunks. In Proceedings of HLT-NAACL 2004: Short papers (pp. 149-152).

Dunning, T. (1993). Accurate Methods For The Statistics Of Surprise And Coincidence. Computational Linguistics. Vol. 19(1), 61-74.

Evert, S. (2005). The Statistics Of Word Cooccurrences: Word Pairs And Collocations. PhD thesis, University of Stuttgart.

Fahmi, I. (2005). C-Value Method For Multi-Word Term Extraction. In Seminar In Statistics And Methodology.

Frantzi, K., Ananiadou, S. & Mima, H. (2000). Automatic Recognition Of Multi-Word Terms:. The C-Value/Nc-Value Method. International Journal on Digital Libraries. Vol. 3(2), 115-130.

Hazaa, M. A. S., Omar, N., Ba-Alwi, F. M., & Albared, M. (2016). Automatic Extraction of Malay Compound Nouns using a Hybrid of Statistical and Machine Learning Methods. International Journal of Electrical and Computer Engineering. Vol. 6(3), 925-926.

Ittoo, A. & Bouma, G. (2013). Term Extraction From Sparse, Ungrammatical Domain-Specific Documents. Expert Systems with Applications. Vol. 40(7), 2530-2540.

Korayem, M., Crandall, D. & Abdul-Mageed, M. (2012). Subjectivity and Sentiment Analysis of Arabic: A Survey. Cs.indiana.edu, 1–10.

Mahdaouy, A. El, Ouatik, S. E. L. & Gaussier, E. (2014). A Study of Association Measures and their Combination for Arabic MWT Extraction. arXiv Preprint arXiv:1409.3005.

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR). Vol. 41(2), 10.

Saif, A. M. & Aziz, M. J. A. (2011). An Automatic Collocation Extraction From Arabic Corpus. Journal of Computer Science. Vol. 7(1), 6-7.

Salehi, B. (2016). Flexible Language Independent Multiword Expression Analysis.

Selamat, A. & Ng, C.-C. (2011). Arabic Script Web Page Language Identifications Using Decision Tree Neural Networks. Pattern Recognition. Vol. 44(1), 133-144.

Vu, T., Aw, A. T. & Zhang, M. (2008). Term Extraction Through Unithood And Termhood Unification. In In Proc. of International Joint Conference on Natural Language Processing, 631-636.

Downloads

Published

2018-05-30

Issue

Section

Articles