This page is a collection of datasets used in my research activity.

When you contact me for data requests, please provide information about your academic status (your home institution, your current position) and explain in brief how you will use the data. Datasets won’t be shared with parties for commercial purposes. Two terms of usage apply:

The appropriate papers are cited in any research product based on these datasets.
The datasets cannot be redistributed without obtaining my permission.

COVID-19 Tweets

Since January 2020, we collected hundreds of millions of tweets related to COVID-19.

Dataset: https://github.com/echen102/COVID-19-TweetIDs
Paper: https://publichealth.jmir.org/2020/2/e19273/

2020 U.S. Election Tweets

We collected hundreds of millions of election-related tweets for the 2020 U.S. Presidential election.

Dataset: https://github.com/echen102/us-pres-elections-2020
Paper: https://arxiv.org/abs/2010.00600

Individual Performance in Team-based Online Games

The Dataset used in this study has been deposited in the Harvard Dataverse repository (doi:10.7910/DVN/B0GRWX), and is available at the following URL:

Access the League of Legends Dataset

The Code used to obtain the results presented in the paper is openly available in four IPython Jupyter Notebooks:

Performance Modeling: RQ1-RQ3
Performance Prediction: RQ4-Model1 RQ4-Model2 RQ4-Model3

Please cite: Sapienza, A., Zeng, Y., Bessi, A., Lerman, K., & Ferrara, E. (2018). Individual performance in team-based online games. Royal Society open science, 5(6), 180329.

Extremist Propaganda Dataset

This dataset is associated with the paper “Contagion dynamics of extremist propaganda in social networks” (Information Sciences). PDF

Instagram Dataset

Source: Public media and user information from Instagram.com (through Instagram API).
Crawling period: Jan 20 – Feb 17, 2014.
Description: The media dataset contains records of the form: the anonymized media ID, the anonymized ID of the user who created the media, the timestamp of media creation, the set of tags assigned to the media, the number of likes and the number of comments it received. The anonymized user network contains asymmetric relations (A follows B); each edge is associated with #likes (by A to media created by B), #comments and the list of comments’ timestamps.

Size

Media dataset: 1.7M media associated to 2K users, with 9M tags, 1200M likes, and 41M comments.
User network: about 45K vertices and 678K edges.

Request data

Media dataset: about 51MB (200MB uncompressed).
User network: about 21MB (7MB uncompressed).

Please cite:

Emilio Ferrara, Roberto Interdonato, Andrea Tagarelli.
Online Popularity and Topical Interests through the lens of Instagram.
In Proc. 25th ACM Conference on Hypertext and Social Media, September 1–4, 2014, Santiago, Chile.

Datasets

This page is a collection of datasets used in my research activity.

COVID-19 Tweets

2020 U.S. Election Tweets

Individual Performance in Team-based Online Games

Extremist Propaganda Dataset

Instagram Dataset

Size

Request data

Please cite:

GenAI Against Humanity featured by the Montreal AI Ethics Institute

Two members of our lab made the Forbes 30 under 30 list!

The Intricacies and Ethical Challenges of Bias in Generative Language Models

Twitter bots for good, and information contagion!

This page is a collection of datasets used in my research activity.

COVID-19 Tweets

2020 U.S. Election Tweets

Individual Performance in Team-based Online Games

Extremist Propaganda Dataset

Instagram Dataset

Size

Request data

Please cite:

GenAI Against Humanity featured by the Montreal AI Ethics Institute

Social approval prompts more online hate!

Two members of our lab made the Forbes 30 under 30 list!

The Intricacies and Ethical Challenges of Bias in Generative Language Models

Video-chat about social media manipulation

Twitter bots for good, and information contagion!