Twitter News Dataset

Abstract

This dataset consists on 5234 news events obtained from Twitter.

Download

Description

Events

The file events.csv.gz contains a CSV file, called events.csv with all the news events captured from Twitter since August, 2013 until June, 2014. The format of each line of the file is the following:

<event ID>,<date>,<total keywords>,<total tweets>,<keywords>

Where:

  • <event ID> is an integer which identifies the corresponding event. There are 5234 events, then <event ID> ranges from 1 to 5234.
  • <date> is the date of the event or connected component. The format is YYYY-MM-DD.
  • <total keywords> is an integer indicating how many keywords are in the event or connected component.
  • <total tweets> is an integer indicating how many tweets belongs to this event.
  • <keywords> is a string containing <total keywords> keywords. There is a semicolon between two keywords.

Tweets

The file tweets.csv.gz (available upon request via email to the authors) contains a CSV file, called tweets.csv, with all the tweets IDs corresponding to each event in events.csv. The format of each line of the file is the following:

<tweet ID>, <event ID>

Where:

  • <tweet ID> is an long number indicating the Twitter ID of the given tweet. Using the Twitter REST API it is possible to retrieve all the information about the given tweet.
  • <event ID> corresponds to the event ID of the given tweet.

Clusters

The files cluster_labels.txt and time_resolutions.txt contain the cluster labels for each event and the time resolutions learned from all events, respectively.

  • cluster_labels.txt contains one integer number per line, from 0 to 19. In line i, the cluster label in that line corresponds to the event ID number i.
  • time_resolutions.txt contains one floating point number per line, indicating the time resolution learned for all events, in minutes. There are 20 numbers in the file, one per line, in increasing order, with at most 13 decimal numbers after the point.

Last Modified: 16 November 2016