Datasets

Home / Datasets

This page is a collection of datasets used in my research activity.

When you contact me for data requests, please provide information about your academic status (your home institution, your current position) and explain in brief how you will use the data. Datasets won’t be shared with parties with commercial purposes. Two terms of usage apply:

  • The appropriate papers are cited in any research product based on these datasets.
  • The datasets cannot be redistributed without obtaining my permission.

Individual Performance in Team-based Online Games

The Code used to obtain the results presented in the paper is openly available in two IPython Jupyter Notebooks:

  1. Performance Analysis Notebook
  2. Engagement Prediction Notebook

The Dataset used in this study has been deposited in the Harvard Dataverse repository (doi:10.7910/DVN/B0GRWX), and is available at the following url:

Access to the League of Legends Dataset

Extremist Propaganda Dataset

This dataset is associated with the paper “Contagion dynamics of extremist propaganda in social networks” (Information Sciences). PDF

Instagram Dataset

Source: Public media and user information from Instagram.com (through Instagram API).
Crawling period: Jan 20 – Feb 17, 2014.
Description: The media dataset contains records of the form: the anonymized media ID, the anonymized ID of the user who created the media, the timestamp of media creation, the set of tags assigned to the media, the number of likes and the number of comments it received. The anonymized user network contains asymmetric relations (A follows B); each edge is associated with #likes (by A to media created by B), #comments and the list of comments’ timestamps.

Size

Media dataset: 1.7M media associated to 2K users, with 9M tags, 1200M likes, and 41M comments.
User network: about 45K vertices and 678K edges.

Request data
Please cite:

Emilio Ferrara, Roberto Interdonato, Andrea Tagarelli.
Online Popularity and Topical Interests through the lens of Instagram.
In Proc. 25th ACM Conference on Hypertext and Social Media, September 1–4, 2014, Santiago, Chile.

Facebook Datasets

During August 2010, I have collected two samples of the Facebook friendship graph, by adopting two techniques:

  1. Breadth First Search (BFS) traversal algorithmIt contains about 7 millions of nodes and 12 millions of edges
  2. Uniform (UNI) sampling approach (rejection sampling)It contains about 7 millions of nodes and 7 millions of edges
Related Papers

The following papers are based on these datasets.
Please cite those which are relevant to your research if you use any Facebook dataset.

  1. P. De Meo, E. Ferrara, G. Fiumara, A. Provetti.
    On Facebook, most ties are weak.
    Communications of the ACM 57 (11), 78-84, 2014.
    Useful links: PDF | CACM
  2. S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti.
    Extraction and analysis of Facebook friendship relations.
    Computational Social Networks: Mining and Visualization, pp 291-324, 2012.
    Useful links: PDF | Book page
  3. S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti.
    Crawling Facebook for social network analysis purposes.
    WIMS ’11: Proceedings of the International Conference on Web Intelligence, Mining and Semantics, 2011.
    Useful links: PDF | ACM | Arxiv
  4. E. Ferrara.
    Community structure discovery in Facebook.
    International Journal of Social Network Mining, 1(1):67-90, 2012.
    Useful links: PDF | Journal page
  5. E. Ferrara.
    A large-scale community structure analysis in Facebook.
    EPJ Data Science, 1(9):1-30, 2012.
    Useful links: PDF | Journal page | Arxiv

Download Facebook datasets

Data are provided using the “edge list” format, tab divided.
UserIDs are anonymized.

BFS sample

UNI sample