Twitter cascade dataset

This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having...

Full description

Saved in:

Bibliographic Details
Main Author:	Living Analytics Research Centre
Format:	text
Published:	Institutional Knowledge at Singapore Management University 2017
Subjects:	Computer Sciences
Online Access:	https://ink.library.smu.edu.sg/researchdata/20 https://larc.smu.edu.sg/twitter-cascade-dataset
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University

id	sg-smu-ink.researchdata-1021
record_format	dspace
spelling	sg-smu-ink.researchdata-10212018-01-18T02:27:32Z Twitter cascade dataset Living Analytics Research Centre This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized. 2017-12-01T08:00:00Z text https://ink.library.smu.edu.sg/researchdata/20 https://larc.smu.edu.sg/twitter-cascade-dataset SMU Research Data Institutional Knowledge at Singapore Management University Computer Sciences
institution	Singapore Management University
building	SMU Libraries
country	Singapore
collection	InK@SMU
topic	Computer Sciences
spellingShingle	Computer Sciences Living Analytics Research Centre Twitter cascade dataset
description	This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.
format	text
author	Living Analytics Research Centre
author_facet	Living Analytics Research Centre
author_sort	Living Analytics Research Centre
title	Twitter cascade dataset
title_short	Twitter cascade dataset
title_full	Twitter cascade dataset
title_fullStr	Twitter cascade dataset
title_full_unstemmed	Twitter cascade dataset
title_sort	twitter cascade dataset
publisher	Institutional Knowledge at Singapore Management University
publishDate	2017
url	https://ink.library.smu.edu.sg/researchdata/20 https://larc.smu.edu.sg/twitter-cascade-dataset
_version_	1681132637464297472

Twitter cascade dataset

Similar Items