Data Collection

The codes corresponding to the processes discussed on this page can be found on this link: scraper/

Web Scraping

Using the web scraping tool called snscrape, we were able to scrape 2,730 unique tweets. Unfortunately, due to time constraints, we were not able to review and classify each scraped tweet whether it is a disinformation tweet or not. However, from these tweets, we managed to identify 203 disinformation tweets from 158 accounts. Moreover, we also identified 265 non-disinformation tweets as a control dataset for our data analysis.

Data Columns

The following are all the columns automatically labeled by the scraper.

tweet_idtweet_urlkeywordsaccount_handleaccount_name
account_bioaccount_bio_renderedaccount_verifiedjoinedfollowing
followerslocationtweettweet_rendereddate_posted
likesrepliesretweetsquote_tweetsviews
source_urlsource_labellinks_urlmediaretweeted_tweet_id
quoted_tweet_idin_reply_to_tweet_idin_reply_to_user_idconversation_id

Date Labeling

Aside from the columns above, we also added new columns which we manually labeled. These new columns are:

  • leni_sentiment - The tweet's sentiment (negative, neutral, positive) towards former Vice President Robredo.
  • marcos_sentiment - The tweet's sentiment (negative, neutral, positive) towards President Marcos Jr.
  • incident - The incident associated to the misinformation tweet. This is expounded below.
  • account_type - Indicates whether the account is anonymous, identified, or media.
  • tweet_type - Indicates whether the tweet is text, reply, image, URL, video, or a combination of these.
  • content_type - Indicates whether the tweet is rational, emotional, transactional, or a combination of these.
  • country - Indicates the country of the account based on their profile location field.
  • has_leni_ref - Indicates whether the tweet contains references to former Vice President Robredo (labeled as 1 if there is a reference, 0 otherwise).
  • alt-text - The alt-text of the tweet in case it contains videos, images, or articles.
We added these manually labeled columns to the columns produced by the scraper. In other words, the tweet data that we used in data exploration (and we'll be using in data analysis) have all the columns stated above (union of the columns produced by the scraper and the columns we manually labeled.)

The Allegations/Incidents

We have identified five disinformation topics/incidents about the Robredo sisters.

  • Jillian Robredo heckling at Baguio (codename: Baguio)
  • Alleged ladder incident involving Tricia Robredo (codename: Ladder)
  • Alleged sensitive videos of Aika and Tricia Robredo (codename: Scandal)
  • Alleged quarantine violation by the Robredo's (codename: Quarantine)
  • Other topics include, "dissemination of anti-BBM flyers," and "accusing Leni of using public funds for her daughter's Harvard tuition." (codename: Others)
The codenames indicated above are used to label the incident column.

Chismisinformation Team