Search for collections on Repository Universitas Islam Riau

Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets

Nasution, Arbi Haza and Onan, Aytug and Murakami, Yohei and Monika, Winda (2025) Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets. IEEE Access, 13 (-). pp. 94009-94025. ISSN 2169-3536

[thumbnail of J6_Benchmarking_Open-Source_Large_Language_Models_for_Sentiment_and_Emotion_Classification_in_Indonesian_Tweets.pdf]
Preview
Text
J6_Benchmarking_Open-Source_Large_Language_Models_for_Sentiment_and_Emotion_Classification_in_Indonesian_Tweets.pdf

Download (2MB) | Preview

Abstract

We benchmark 22 open-source large language models (LLMs) against ChatGPT-4 and human annotators on two NLP tasks—sentiment analysis and emotion classification—for Indonesian tweets. This study contributes to NLP in a relatively low-resource language (Bahasa Indonesia) by evaluating zero-shot classification performance on a labeled tweet corpus. The dataset includes sentiment labels (Positive, Negative, Neutral) and emotion labels (Love, Happiness, Sadness, Anger, Fear). We compare model predictions to human annotations and report precision, recall, and F1-score, along with inference time analysis. ChatGPT-4 achieves the highest macro F1-score (0.84) on both tasks, slightly outperforming human annotators. The best-performing open-source models—such as LLaMA3.1_70B and Gemma2_27B— achieve over 90% of ChatGPT-4’s performance, while smaller models lag behind. Notably, some mid-sized models (e.g., Phi-4 at 14B parameters) perform comparably to much larger models on select categories. However, certain classes—particularly Neutral sentiment and Fear emotion—remain challenging, with lower agreement even among human annotators. Inference time varies significantly: optimized models complete predictions in under an hour, while some large models require several days. Our findings show that stateof-the-art open models can approach closed-source LLMs like ChatGPT-4 on Indonesian classification tasks, though efficiency and consistency in edge cases remain open challenges. Future work should explore fine-tuning multilingual LLMs on Indonesian data and practical deployment strategies in real-world applications.

Item Type: Article
Uncontrolled Keywords: Annotation quality, emotion classification, sentiment analysis, Indonesian language processing, language models, low-resource languages, natural language processing.
Subjects: T Technology > T Technology (General)
Divisions: > Teknik Informatika
Depositing User: Monika Winda Monika
Date Deposited: 26 Jun 2025 01:54
Last Modified: 26 Jun 2025 01:54
URI: https://repository.uir.ac.id/id/eprint/24931

Actions (login required)

View Item View Item