Twitter News Dataset
Abstract
This dataset consists on 5234 news events obtained from Twitter.
Download
- events.csv.gz (98KB, 263KB decompressed)
- cluster_labels.txt (14KB)
- time_resolutions.txt (445B)
Description
Events
The file events.csv.gz
contains a CSV file, called events.csv
with
all the news events captured from Twitter since August, 2013 until
June, 2014. The format of each line of the file is the following:
<event ID>,<date>,<total keywords>,<total tweets>,<keywords>
Where:
<event ID>
is an integer which identifies the corresponding event. There are 5234 events, then<event ID>
ranges from 1 to 5234.<date>
is the date of the event or connected component. The format isYYYY-MM-DD
.<total keywords>
is an integer indicating how many keywords are in the event or connected component.<total tweets>
is an integer indicating how many tweets belongs to this event.<keywords>
is a string containing<total keywords>
keywords. There is a semicolon between two keywords.
Tweets
The file tweets.csv.gz
(available upon request via email to the
authors) contains a CSV file, called tweets.csv
, with all the
tweets IDs corresponding to each event in events.csv
. The format
of each line of the file is the following:
<tweet ID>, <event ID>
Where:
<tweet ID>
is an long number indicating the Twitter ID of the given tweet. Using the Twitter REST API it is possible to retrieve all the information about the given tweet.<event ID>
corresponds to the event ID of the given tweet.
Clusters
The files cluster_labels.txt
and time_resolutions.txt
contain
the cluster labels for each event and the time resolutions learned
from all events, respectively.
cluster_labels.txt
contains one integer number per line, from 0 to 19. In line i, the cluster label in that line corresponds to the event ID number i.time_resolutions.txt
contains one floating point number per line, indicating the time resolution learned for all events, in minutes. There are 20 numbers in the file, one per line, in increasing order, with at most 13 decimal numbers after the point.
Last Modified: 16 November 2016