Data Analysis
Statistical Analysis
The code for this section can be found in apply_chi_square.py
Research Question: Is there a significant difference between tweet disinformation groups in terms of sentiment towards Leni Robredo?
Contingency Table
Note: this is the tabular form of this graph.
| Leni Sentiment | |||
|---|---|---|---|
| Negative | Neutral | Postive | |
| Disinformation | 92 | 111 | 0 |
| Non-disinformation | 0 | 218 | 47 |
Hypothesis Testing
We used chi-square test for independence to test whether or not to reject the null hypothesis.
| Chi Square Statistic | 169.01 |
|---|---|
| p-value | 2.00e-37 |
Reject the null hypothesis: There is a significant difference between the disinformation groups in terms of sentiment towards Leni Robredo.
Disinformation Classifier
We created a disinformation classifier that aims to classify if a given tweet is disinformation or not. The purpose of this is so that we can identify the recurring patterns or features occuring among disinformation tweets -- in other words, what are the best predictors that identify a disinformation tweet.
The code for this section can be found in disinfo_classifier.py.
Model Structure
We used gradient-boosted trees as our model, using the software library XGBoost. The features are the vectorized tweet tokens, leni_sentiment, and marcos_sentiment. The target variable is whether or not a tweet is disinformation or not.
Here, we used 201 disinformation tweets and 265 non-disinformation tweets as our samples. We performed 5-fold cross-validation on this dataset.
Classification Report
The table below shows the classification report of our model -- averaged over the five rounds of cross-validation. The model performed well given f1 scores of about 0.8 and accuracy of 0.81.
| Precision | Recall | F1 Score | Support | |
|---|---|---|---|---|
| Disinformation | 0.82 | 0.85 | 0.84 | 53 |
| Non-disinformation | 0.81 | 0.76 | 0.78 | 40.2 |
| Accuracy | 0.81 | 93 |
Confusion Matrix

Most Important Features
The following features (mostly) decide the disinformation classification problem. This is based on XGBoost's feature importance. In particular, it is based on gain or the improvement in accuracy when this feature is used to split a branch.
leni_sentiment(label)- igorot (token)
- educate (token)
- jillian (token)
- fake (token)
- facewithtearsofjoy (emoji token)
Insights
While the tweets' primary targets were the Robredo siblings, many of the tweets malign the image of ex-VP Robredo as well. In fact, as we've just seen, a tweet's sentiment towards ex-VP Robredo is an important predictor if it is disinformation or not. In fact, a tweet with negative sentiment towards the ex-VP will most likely be a disinformation tweet. This is consistent with our findings in Statistical Analysis.
This observation, combining with the fact that most of disinformation tweets were posted just one month before the election, tells us that the disinformation tweets targeting the Robredo siblings were somehow linked to the national elections. In particular, we conjecture that disinformation tweets targeting the Robredo sisters have the primary goal to ruin the image of the ex-VP just because she was running for presidency. In other words, the Robredo siblings were used by trolls and fake news peddlers to ruin the image of the ex-VP in order to reduce her chances of winning the national elections.