Syntax Literate: Jurnal Ilmiah
Indonesia p�ISSN: 2541-0849 e-ISSN: 2548-1398
Vol. 7, No. 6, Juni 2022
DETECTION OF NEGATIVE
CONTENT (HOAX) ON MICROBLOG DATA
THAT CONTAINS COVID-19
INFORMATION
Putra
Tresna Linge, Alfan Farizki Wicaksono
Magister
Teknologi Informasi Universitas Indonesia Jakarta, Indonesia
Email: [email protected], [email protected]
Abstract
Over the past few years, the amount of information dissemination has
increased, especially since the advent of social media. Among the information
circulating, there is information that includes negative content or hoax that
have a bad impact such as the emergence of divisions due to incorrect
information. Based on the 2018 Kominfo performance
report, Twitter social media is the largest contributor to the spread of hoax.
To reduce the impact of the spread of hoax, a method is needed to detect hoaxes
on Twitter so that prevention can be done such as taking down tweets that are
hoax. The purpose of this research is to develop a model that can detect
negative content (hoax) automatically and also see the correlation between hoax
content and sentiment orientation. The results of this study are a machine
learning-based model using a decision tree algorithm with an accuracy of 97.2%
with a precision value of 85.4, recall of 81.4, and f1-score 93 and the model.
In addition, the results of the analysis show that tweets that are hoax as a
result of model identification are dominated by positive sentiment orientation,
which is 52.64% of the total data identified as hoax
Keywords: Hoax Detection, Twitter, Sentiment Orientation Classification, Machine
Learning, Teks Analysis
Introduction
Currently, the exchange of information takes place in a short time and massive amounts. Negative content (hoax) can have various bad effects on the information circulating. Examples of bad impacts are the split during the general election era, where each faction creates hoaxes to bring down the other faction(Juditha, 2019),(Sutantohadi & Rokhimatul Wakhidah, 2017). In addition, hoaxes can also cause terror or fear, as happened some time ago related to information on COVID-19 which led to "panic buying" behavior (Alamsyah, 2020; Somantri, 2020; Wardani, 2017). Covid-19 is a new variant of the virus where information in the form of facts is still minimally known by many people. This has caused several hoaxes, especially those related to COVID-19. Hoaxes about covid-19 that most often appear are related to the presence or absence of covid-19 and the covid-19 vaccine.
Reports for handling hoaxes in recent years have found that there has been an increase in the spread of hoaxes in the health sector on social media, especially on Twitter. To overcome this, the government made efforts to deal with hoaxes by monitoring the information circulating, clarifying the actual information, and taking action against hoax spreaders(Kemkominfo, 2021).
The purpose of this study is to
develop a model that can detect hoaxes automatically in tweets on Twitter to
reduce the spread level.
Cross-Industry Standard Process for Data Mining or CRISP-DM is a standard in data mining processing. CRISP-DM was built in 1966 to use it for data mining, analytics, and science projects(Sihombing, Jayadi, Chandra, & Liu, 2020). In CRISP-DM the data mining process is divided into 6 stages consisting of business understanding, data understanding, data preparation, modeling, evaluation and deployment as shown in Figure 1.
Figure 1. Illustration of stages in CRISP-DM
B. DATA
PREPARATION
Data Acquisition, Labeling, and Preprocessing
Step of preprocessing:
�
Case folding, the stages of
unfirming characters into lowercase letters.
�
Filtering, steps to remove
URLs, hashtags, hyperlinks and emoticons.
�
Tokenizing is the process of
dividing text into certain parts based on punctuation marks, numbers, words,
and others.
�
Translate, the process of
translating words that contain repeated letters, so that the words �noo�, �no�
and �Noooooo� are translated as the same word, namely �no�.
�
Stopword Removal, removes words
that have no meaning, such as �at�, �at�, �to�, �which�, and others
C. Synthetic Minority Over-sampling Technique
(SMOTE)
The Synthetic Minority
Over-Sampling Technique, or SMOTE, is used to overcome unbalanced data sets (Chawla, Bowyer, Hall, & Kegelmeyer, 2002;
Pangastuti, 2018).
Data is said to be unbalanced or imbalanced if the number between the classes
is not equally represented(Chawla et al., 2002).
The principle of applying SMOTE so that a data set is balanced by increasing
the number of samples in the minor class to be equivalent to the major class.
Minor class addition generates synthesis data based on the nearest neighbor
(k-nearest neighbor).
The Na�ve Bayes classification method combines the supervised learning method (a learning method using sample data that has a label to then be used for new data classification) and probability classification(Parsian, 2015). The basic principle of nave Bayes is to apply Bayesian theory (from Bayesian statistics) with strong independent (naive) assumptions(Lesmana, 2013). In general, the formula for the Na�ve Bayes algorithm is as follows(Walia, Rana, & Kansal, 2018):
�
P(H|A): Hypothesis
Probability in data set A
�
P(A|H): Probability of
data set A in Hypothesis
�
P(H): Probability of the
hypothesis (prior probability)
�
P(A): Probability of the
observed sample data
The Support Vector Machine or SVM was first described by Vapnik, Bernhard Boser, and Isabelle Guyon in 1992(Ritonga & Purwaningsih, 2018). SVM is an algorithm that uses nonlinear mapping to convert the original training data to a higher. The model formed by SVM is in the form of a hyperlane, which is a function that is used to separate two different classes.
Figure 2 Illustration of support vector machine
Decision
Tree
Decision tree is a decision-making method based on a flow diagram shaped like a tree consisting of several parts, namely the root node, internal node, and leaf node, as illustrated in Figure 3(S� et al., 2016). The data obtained is included in the root node category, while the internal node contains statements from the data. A leaf node is a problem solving or decision-making.
Figure 3 Illustration of decision tree components
Random forest was developed by Breiman in 2001 with the aim of improving the prediction process for the bagging method(Pangastuti, 2018). The random forest method is a collection of trees (decision tree) combined into a model, as shown in Figure 4 (Al-Ash, Putri, Mursanto, & Bustamam, 2019). From a collection of single trees, it is expected to have a small correlation result to obtain a smaller variety of estimates.
Figure 4 Illustration of random forest
K-Fold Cross Validation is a method to evaluate the performance of a model. The evaluation process is carried out by dividing the data into subsets of k partitions of the same size, where k indicates the number of partitions (Hulu, 2020). For each partition, a modeling process will be carried out with a performance test. Figure 5 illustrates the k-fold cross validation process where the data set is divided into 5 partitions. The 1st iteration shows that partition 1 is the test data and the other partitions are the training data. The 2nd iteration process is carried out like the process in the 1st iteration, but for the test data using partition 2 and the training data using a partition other than partition 2. Treat the replacement of test data and training data until the last partition.
Figure 5 Illustration of
K-Fold Cross Validation
Confusion matrix merupakan tabel yang merangkum serta menunjukkan performa dari algoritma machine learning (Widaretna, Tirtawangsa, & Romadhony, 2021). Komponen penyusun confusion matrix terdiri atas true positive (TP), true negative (TN), false positive (FP) dan false negative (FN).
� TP:
positive data predicted correctly
� TN:
negative data predicted correctly
� FP:
negative data predicted as positive data
� FN:
negative data predicted as negative data.
The
use of the confusion matrix is shown to provide information about the types of
errors made by the prediction model. From the confusion matrix, there are
several performances that can be known from a prediction model including
accuracy, precision and recall.
Research
methods
At this stage, a
problem search is carried out using the Gap Analysis Technique to find common
problems that exist. Problem identification is done with documents available on
the internet and can be accessed by anyone. These documents are in the form of
the Kominfo Strategic Plan, Kominfo
Annual Report, Kominfo Performance Report and
statistical data from survey results by several survey institutions. The data
used comes from tweets containing information on COVID-19 or Corona on Twitter.
The data collection process is carried out using a crawling technique with an
API that has been provided by twitter.
Results and
Discussion
From the labeling results, it can be seen
that the comparison of the number of tweets between the non-hoax and hoax
categories is quite far (1586 and 231) so that the sample data obtained is
included in the unbalanced data category (imbalance). To overcome this, the
SMOTE method was applied so that the hoax identification model obtained had
better results.
The hoax identification model obtained is the
result of machine learning using sample data that has been applied to the SMOTE
method as learning data and test data. There are several algorithms used to
create a hoax identification model, including Gausian
Na�ve Bayes, Multinomial Nave Bayes, Bernaulli Na�ve
Bayes, Decision Tree, Random Forest and Support Vector Machine.
In making the hoax identification model, the
distribution of sample data into learning data and testing data was given two
treatments. The first treatment was to share learning data and test data with a
ratio of 70:30, hereinafter referred to as modeling without the application of
k-fold. The second treatment was using the k-fold cross validation method with
the configuration k=5.
To determine the best hoax identification
model, a confusion matrix is used to measure the performance of
each algorithm. Because the data used is included in the imbalance category,
the f1-score value of each model reflects the performance of the model.
The results of the evaluation of the hoax
identification model for each algorithm using the confusion matrix for the
first treatment can be seen in Table 5.1 and Figure 5.1. From the evaluation
results, the SVM algorithm shows the highest f1-score achievement with a value
of 96 with precision, recall and accuracy values of 95, 98 and
99%
Table 1 Results of hoax
identification modeling without applying K-Fold
Algoritma |
Precision |
Recall |
F1-Score |
Accuracy |
Gaussian Na�ve Bayes |
87 |
100 |
93 |
98 |
Multinomial Na�ve Bayes |
85 |
100 |
92 |
98 |
Bernoullil Na�ve Bayes |
100 |
88 |
94 |
99 |
Support Vector Machine |
95 |
98 |
96 |
99 |
Decision Tree |
27 |
93 |
42 |
71 |
Random Forest |
100 |
88 |
94 |
99 |
Figure
1 The performance of the hoax detection model without the application of k-fold
In The Second Treatment, The Results Of The Evaluation Of The Hoax Identification Model For Each Algorithm Can Be Seen In Table 5.2 And Figure 5.2. Based On The Evaluation Results In The Second Treatment, The Highest F1-Score Value Was Obtained By Modeling The Decision Tree Algorithm With A Value Of 83 With Precision, Recall And Accuracy Values Of 85.4, 81.4 And 97.2%.
Table 2
Results of Hoax Identification Modeling by Applying K-Fold
Algoritma |
Precision |
Recall |
F1-Score |
Accuracy |
Gaussian Na�ve
Bayes |
75,2 |
84 |
80,2 |
95,6 |
Multinomial Na�ve
Bayes |
77,8 |
87,8 |
79,4 |
87 |
Bernoullil Na�ve Bayes |
79,2 |
57,4 |
64,8 |
87,8 |
Support Vector
Machine |
60 |
1,2 |
3 |
89 |
Decision Tree |
85,4 |
81,4 |
83 |
97,2 |
Random Forest |
80 |
71,2 |
75,4 |
96,6 |
Figure
2 The performance of the hoax detection model with the application of k-fold
The best hoax identification model obtained is applied to all crawled data. The model used is the best model with the application of the k-fold method because it is more reliable. The results of the identification as shown in Figure 5.3, of the 18,170 tweets carried out by the hoax identification process, there were 10,104 tweets that were identified as not hoaxes and 8,066 tweets that were identified as hoaxes.
Figure
3 Hoax identification results
Tweets that are identified as hoaxes are carried out by a sentiment orientation classification process and generate a sentiment orientation as shown in Figure 5.4. From the results of the sentiment orientation classification, 3,820 tweets are classified as negative sentiment-oriented tweets and 4,246 tweets are classified as positive sentiment-oriented.
Figure
4
The
results of the classification of sentiment orientation on twitter hoaxes
The author's expectations on tweets identified as hoaxes will be
dominated by tweets with a negative orientation classification. This is because
the characteristics of hoaxes are provocative and emotional.
Conclusion
The best hoax classification model for detecting potential hoaxes in tweets on Twitter uses the SVM algorithm for models without the application of k-fold with values of precision, recall, f1-score and accuracy of 95, 98, 96 and 99%.
The results of the
classification of sentiment orientation on the classified hoax data using the
best model (with the application of k-fold) obtained tweets with more positive
sentiment orientation (52.64%). This is caused by several factors, namely the
simple sentiment orientation classification method (using the lexicon) and the
data used is less reliable because it uses data from the prediction model that
is not 100% accurate.
Al-Ash, Herley Shaori, Putri, Mutia
Fadhila, Mursanto, Petrus, & Bustamam, Alhadi. (2019). Ensemble Learning
Approach on Indonesian Fake News Classification. ICICOS 2019 - 3rd
International Conference on Informatics and Computational Sciences:
Accelerating Informatics and Computational Research for Smarter Society in The
Era of Industry 4.0, Proceedings.
https://doi.org/10.1109/ICICoS48119.2019.8982409
Alamsyah, Syahdan. (2020). Heboh Isu Pasien Suspect
Corona di Sukabumi , RS Belum Buka Suara.
Chawla, Nitesh V., Bowyer, Kevin W., Hall, Lawrence
O., & Kegelmeyer, W. Philip. (2002). SMOTE: Synthetic Minority
Over-sampling Technique. Journal of Artificial Intelligence Research, 16(Sept.
28), 321�357.
Hulu, Sitefanus(Universitas Sumatera Utara). (2020).
Analisis Kinerja Metode Cross Validation Dan K-Nearest Neighbor Dalam
Klasifikasi Data. In Universitas Sumatera Utara. Medan.
Juditha, Christiany. (2019). Literasi Informasi
Melawan Hoaks Bidang Kesehatan di Komunitas Online. Jurnal ILMU KOMUNIKASI,
16(1), 77. https://doi.org/10.24002/jik.v16i1.1857
Kemkominfo. (2021). Laporan Tahunan 2020 (Indonesia
Tekoneksi: Semakin Digital, Semakin Maju). Jakarta.
Lesmana, Pekik Indra. (2013). Analisis Sentimen
Pengguna Layanan Media Sosial Twitter di Indonesia. Jakarta.
Pangastuti, Sinta Septi. (2018). Perbandingan
Metode Ensemble Random Forest Dengan Smote-Boosting Dan Smote-Bagging Pada
Klasifikasi Data Mining Untuk Kelas Imbalance a Comparison of the Ensemble
Random Forest Methods With Smote-Boosting and Smote-Bagging on Data Mining
Classification Fo. Surabaya.
Parsian, Mahmoud. (2015). Data Algorithms: Recipes
for Scaling Up with Hadoop and Spark (1st ed.). O�Reilly Media, Inc.
Ritonga, Alven Safik, & Purwaningsih, Endah
Supeni. (2018). Penerapan Metode Support Vector Machine ( SVM ) Dalam
Klasifikasi Kualitas Pengelasan Smaw ( Shield Metal Arc Welding ). Ilmiah
Edutic, 5(1), 17�25.
S�, J. A. S., Almeida, A. C., Rocha, B. R. P., Mota,
M. A. S., Souza, J. R. S., & Dentel, L. M. (2016). Lightning Forecast
Using Data Mining Techniques On Hourly Evolution Of The Convective Available
Potential Energy. (March), 1�5. https://doi.org/10.21528/cbic2011-27.1
Sihombing, Pangondian Prederikus, Jayadi, Riyanto,
Chandra, Edward, & Liu, Stefanie. (2020). Support vector machine-based hoax
detection on indonesian online news. International Journal of Advanced
Trends in Computer Science and Engineering, 9(4), 6202�6207.
https://doi.org/10.30534/ijatcse/2020/297942020
Somantri, Andri. (2020). Jangan Panik Corona! Warga
Sukabumi Diminta Tak Usah Serbu Pasar.
Sutantohadi, Alief, & Rokhimatul Wakhidah. (2017).
Bahaya Berita Hoax Dan Ujaran Kebencian Pada Media Sosial Terhadap Toleransi
Bermasyarakat. DIKEMAS (Jurnal Pengabdian Kepada Masyarakat), 1(1),
1�5. https://doi.org/10.32486/jd.v1i1.153
Walia, Himdweep, Rana, Ajay, & Kansal, Vineet.
(2018). A Na�ve Bayes Approach for working on Gurmukhi Word Sense
Disambiguation. 2017 6th International Conference on Reliability, Infocom
Technologies and Optimization: Trends and Future Directions, ICRITO 2017, 2018-Janua,
432�435. https://doi.org/10.1109/ICRITO.2017.8342465
Wardani, Maria Magdalena Sinta. (2017). Manipulasi
Bahasa dalam Teror Kabar Bohong (Hoax). Sintesis, 11(2), 87�94.
Widaretna, Titi, Tirtawangsa, Jimmy, & Romadhony,
Ade. (2021). Hoax Identification on Tweets in Indonesia Using Doc2Vec. 2021
9th International Conference on Information and Communication Technology,
ICoICT 2021, 456�461. https://doi.org/10.1109/ICoICT52021.2021.9527515
Copyright holder: Putra Tresna Linge, Alfan Farizki Wicaksono (2022) |
First publication right: Syntax Literate: Jurnal Ilmiah
Indonesia |
This article is licensed
under: |