Ekstraksi Kata Dasar Secara Berjenjang (Incremental Stemming) Berbasis Aturan Morfologi untuk Teks Berbahasa Indonesia
Main Article Content
Abstract
Ekstraksi kata dasar atau stemming pada Bahasa Indonesia adalah proses yang kompleks di mana beberapa partikel awalan dan beberapa partikel akhiran dari 13 awalan, 3 sisipan dan 19 akhiran yang dikenal dapat digunakan secara sekaligus pada sebuah kata. Selain itu, proses stemming tidak selalu menghasilkan 1 kata dasar (non-deterministik), karena terdapat beberapa kata dalam bahasa Indonesia yang memiliki 2 kemungkinan yaitu sebagai kata dasar maupun kata berimbuhan, misalnya pada kata “beruang”. Penelitian yang telah ada sebelumnya menggunakan kombinasi awalan dan akhiran yang tidak mungkin dan menerapkan heuristik untuk memilih kata dasar. Dalam penelitian ini diusulkan sebuah metode stemming secara berjenjang di mana berdasarkan urutan tertentu, secara bergantian partikel akhiran dan awalan dilepaskan dari sebuah kata sehingga dihasilkan sebuah kata dasar. Jika ditemukan beberapa kandidat kata dasar maka salah satu kata dasar akan dipilih. Metode ini diuji pada 6464 dokumen Al-Quran Terjemahan Indonesia dengan menggunakan kamus berukuran 5000 kata yang disampling secara acak dari Kamus Besar Bahasa Indonesia.. Dari 3432 kata unik yang diproses 94,7% kata dasar dapat diekstrak secara langsung dan hanya 5,3% yang perlu diproses lebih lanjut karena kandidat kata dasar yang ditemukan lebih dari satu. Dibandingkan dengan melakukan pemilihan kata dasar secara manual, metode ini dapat memilih kata dasar yang tepat hingga 79.12%.
Downloads
Article Details
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
References
[2] Moral, Cristian, et al. "A survey of stemming algorithms in information retrieval." Information Research: An International Electronic Journal 19.1 (2014): n1.
[3] Mayfield, J. & McNamee, P. (2003). Single n-gram stemming. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 415-416). New York, NY: ACM Press.
[4] Peng, F., Ahmed, N., Li, X. & Lu, Y. (2007). Context sensitive stemming for Web search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 639-646). New York, NY: ACM Press.
[5] Setiawan, Reina, Aditya Kurniawan, Widodo Budiharto, Iman Herwidiana Kartowisastro, and Harjanto Prabowo. "Flexible affix classification for stemming Indonesian Language." In Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2016 13th International Conference on, pp. 1-6. IEEE, 2016
[6] Indradjaja, Lily Suryana, and Stephane Bressan. "Automatic learning of stemming rules for the indonesian language." Proc. of the The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC). 2003.
[7] Asian, Jelita, Hugh E. Williams, and Seyed MM Tahaghoghi. "Stemming indonesian." Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38. Australian Computer Society, Inc., 2005.
[8] Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S.M. and Williams, H.E., 2007. Stemming Indonesian: A confix-stripping approach. ACM Transactions on Asian Language Information Processing (TALIP), 6(4), pp.1-33.
[9] Arifin, A., Ciptaningtyas, H., & Mahendra, I. (2009). Enhanced Confix Stripping Stemmer And Ants Algorithm For Classifying News Document In Indonesian Language. The International Conference on Information & Communication Technology and Systems, 5, pp. 149-158.
[10] Suhartono, Derwin. "Lemmatization Technique in Bahasa: Indonesian." Journal of Software 9.5 (2014): 1203
[11] Sinaga, Ardiles, and Hertog Nugroho. "Development of word-based text compression algorithm for Indonesian language document." In Information and Communication Technology (ICoICT), 2015 3rd International Conference on, pp. 450-454. IEEE, 2015.
[12] Widjaja, Marsel, and Seng Hansun. "Implementation of Porter’s Modified Stemming Algorithm in an Indonesian Word Error Detection Plugin Application." International Journal Of Technology 6, no. 2 (2015): 139-150
[13] Purwarianti, A., 2011, July. A non deterministic Indonesian stemmer. In Electrical Engineering and Informatics (ICEEI), 2011 International Conference on (pp. 1-5). IEEE
[14] Suhendar, M. E., and Pien Supinah. "MKDU (Mata Kuliah Dasar Umum) Bahasa Indonesia." Balai Pustaka (1995).