Neural Network-Based Bilingual Lexicon Induction for Indonesian Ethnic Languages

Resiandi, Kartika and Murakami, Yohei and Nasution, Arbi Haza (2023) Neural Network-Based Bilingual Lexicon Induction for Indonesian Ethnic Languages. Applied Sciences, 13 (15). pp. 1-15. ISSN 2076-3417

[img] Text
J5_Neural Network.pdf - Published Version

Download (1MB)
Official URL: https://www.mdpi.com/2076-3417/13/15/8666

Abstract

Indonesia has a variety of ethnic languages, most of which belong to the same language family: the Austronesian languages. Due to the shared language family, words in Indonesian ethnic languages are very similar. However, previous research suggests that these Indonesian ethnic languages are endangered. Thus, to prevent that, we propose the creation of a bilingual dictionary between ethnic languages, using a neural network approach to extract transformation rules, employing character-level embedding and the Bi-LSTM method in a sequence-to-sequence model. The model has an encoder and decoder. The encoder reads the input sequence character by character, generates context, and then extracts a summary of the input. The decoder produces an output sequence wherein each character at each timestep, as well as the subsequent character output, are influenced by the previous character. The first experiment focuses on Indonesian and Minangkabau languages with 10,277 word pairs. To evaluate the model’s performance, five-fold cross-validation was used. The character-level seq2seq method (Bi-LSTM as an encoder and LSTM as a decoder) with an average precision of 83.92% outperformed the SentencePiece byte pair encoding (vocab size of 33) with an average precision of 79.56%. Furthermore, to evaluate the performance of the neural network model in finding the pattern, a rule-based approach was conducted as the baseline. The neural network approach obtained 542 more correct translations compared to the baseline. We implemented the best setting (character-level embedding with Bi-LSTM as the encoder and LSTM as the decoder) for four other Indonesian ethnic languages: Malay, Palembang, Javanese, and Sundanese. These have half the size of input dictionaries. The average precision scores for these languages are 65.08%, 62.52%, 59.69%, and 58.46%, respectively. This shows that the neural network approach can identify transformation patterns of the Indonesian language to closely related languages (such as Malay and Palembang) better than distantly related languages (such as Javanese and Sundanese).

Item Type: Article
Uncontrolled Keywords: natural language processing; low-resource language; Indonesian ethnic languages; bilingual lexicon induction; sequence-to-sequence model
Subjects: T Technology > T Technology (General)
Divisions: > Teknik Informatika
Depositing User: Monika Winda Monika
Date Deposited: 19 May 2025 08:19
Last Modified: 19 May 2025 08:19
URI: http://repository.uir.ac.id/id/eprint/24661

Actions (login required)

View Item View Item