COMM 557: Data Science for Communication & Social Networks

ANN L101, On Thursdays, 12.30-3.20pm
Prof. Emilio Ferrara
Questions? Join the class’ Slack channel!
[Last Update: Spring 2023]

Mining the Social Web

Johnny Mnemonic (1995) — © TriStar Pictures

Course description and learning objectives

Learn how to unleash the full power and potential of Social Web data for research and business application purposes!

The Social Web pervades all aspects of our lives: we connect and share with friends, search for jobs and opportunities, rate products and write reviews, establish collaborations and projects, all by using online social platforms like Facebook, LinkedIn, Yelp and GitHub. We express our personality and creativity through social platforms for visual discovery, collection and bookmarking like Tumblr and Pinterest. We keep up-to-date, communicate and discuss news and topics of our interest on Twitter and Reddit.

In this course, we will explore the opportunities provided by the wealth of social data available from these platforms. You will learn how to acquire, process, analyze and visualize data related to social networks and media activity, users and their behaviors, trends and information spreading. This journey will bring through the lands of data mining and machine learning methods: supervised and unsupervised learning will be applied to practical problems like social link analysis, opinion mining, and building smart recommender systems. We will explore open-source tools to understand how to extract meaning from human language, use network analysis to study how human connect and discover affinities among people’s interests and tastes by building interest graphs.

Taking this course, you should expect to learn about:

Applications of texts and documents analysis.
- Natural Language Processing and Part-of-speech tagging.
- Sentiment Analysis.
- Topic Modeling.
Networks:
- Statistical descriptors of networks: link analysis, centrality, and prestige.
- Network clustering: modularity and community detection.
- Dynamics of information and epidemics: threshold and information cascade models.
- Network biases and network manipulation: paradoxes, bots, disinformation.
- Network visualization algorithms: spring-like layouts, multidimensional scaling, Gephi.
Supervised learning: Crush course on Data Classification.
- Eager vs. Lazy learning: Decision Trees.
- Ensemble methods: Random Forest.
- Classification performance evaluation: Precision/Recall/F1, Accuracy and ROC Curves.
Unsupervised learning: Crush course on Clustering Data.
- Distance and similarity measures & K-means clustering.
- Hierarchical Clustering and Dendrograms.
- Clustering performance evaluation.

All topics will be explored from an applied, practical, computational perspective. This will allow the interested student to deepen the rigorous theoretical implications of the methods in other courses offered by USC (for example, CSCI-567 Machine Learning). Throughout the course, we will deliver several “hands-on” sessions with live coding, data analysis, and problem-solving!

Prerequisites

A basic understanding of programming that will allow you to manipulate data and implement basic algorithms, using any programming language, is recommended. Python will be the “official” programming language used during the hands-on sessions and for learning purposes. We will use IPython Notebook as the environment. However, feel free to use the language you prefer for your assignments and class project. A basic understanding of statistics and algebra will help too.

Books and learning material

Required textbooks (total Amazon price [new/used]: $100/$60)

Web Data Mining (2nd Ed.) —by Bing Liu (Amazon price [new/used]: $48/$35)
Mining the Social Web (2nd Ed.) —by Matthew A. Russell (Amazon price [new/used]: $27/$15)
Programming Collective Intelligence —by Toby Segaran (Amazon price [new/used]: $25/$10)
Network Science Book —by Laśzló Barabási (FREE: http://networksciencebook.com/)
Dive into Python —by (FREE: https://diveintopython3.problemsolving.io/)

Some details: (1) will provide insights on methods and approaches studied throughout the course from a machine learning perspective; (2) and (3) will serve as recipe books to effectively design and make those methods work with Social Web data; (4) and (5) are free resources we will exploit to gather additional material on networks and Python programming.

Technical, recommended (non-required) Python “cookbooks”:

Python Data Visualization Cookbook —by Igor Milovanović (ebook: $14)
Learning IPython for Interactive Computing and Data Visualization —by Cyrille Rossant (ebook: $10)
Learning scikit-learn: Machine Learning in Python —by Raúl Garreta and Guillermo Moncecchi (ebook: $10)

Policy & Grading

Class participation and engagement are essential ingredients for success in your academic career, therefore during class turn off cell phones and ringers (no vibrate mode), laptops and tablets. The only exception to use laptops during class is to take notes. In this case, please sit in the front rows of the classroom: no email, social media, games, or other distractions will be accepted. Students will be expected to do all readings and assignments, and to attend all meetings unless excused, in writing, at least 24 hours prior. This is the (tentative) system that will be employed for grading:

Component	Points	% Grade
Participation	15	15
Midterm exam	35	35
Final exam	50	50

The following misconducts will automatically result in a zero weight for that component of the grade: (1) failing to attend class on the day of your presentation; (2) failing to turn in the assignments by the expected dates; (3) failing to attend meetings of your group’s Hackathon and/or final presentation; (4) failing to submit your final paper by the expected date. Extenuating circumstances will normally include only serious emergencies or illnesses documented with a doctor’s note.

Assignments

Reaction debriefs: Each student will prepare a “synthesis and reaction” debrief in response to the weekly readings. This will be a brief note, aimed at summarizing in one paragraph the gist of the paper, and provide comments or inputs for discussions, including questions, critiques, and/or theoretical and methodological concerns or ideas. These will be used to guide the discussion session of each class. (Reaction debriefs are not graded)

Readings & discussion

During each lecture (starting lecture 2), one student will hold a 10m presentation on one of the daily reading of choice and will help moderate a discussion session about it. The list of readings is available at the end of the syllabus.

Mid-Term Hackathon

The mid-term exam is in the form of a collaborative hackathon.The goal is to develop crucial abilities such as:

Intellectual development: leveraging expertise and multidisciplinary backgrounds, sharing ideas and knowledge.
Teamwork skills: effective brainstorming, communication and presentation, and group problem-solving.
Project management skills: ability to set goals, map progress, prototyping-delivery, and matching deadlines.

If possible, we suggest that participants form groups of 2 members with the goal of solving a single problem. Students are encouraged to form groups with members from different academic background when possible. Each group will propose or receive a different problem.

We will propose several problems of interest for the course, as well as receive your explicit solicitations, that should be agreed upon with the Instructor during the first 4 weeks, in the form of a short one-page proposal clearly stating:

What is the problem?
Why it is deemed relevant.
How the group plans to solve the problem.
Bibliographic references to at least one relevant related paper.

All project proposals will be subject to our approval. Groups will be assigned an approved project, either selected among those proposed by the Instructor, or by the group itself. Each group will receive a 30m slot for the presentation of their results, in which each member of the group is expected to discuss at least one critical task of the project. The grading of the projects will be in part based on crowd-sourced ratings attributed by other fellow students and submitted in anonymous form at the end of each presentation day.

Final Paper

A serious final paper will be expected. The manuscript will be at least 3,000 words (excluding references) and no more than 4,000 (excluding references) and will include appropriate figures and tables, and unlimited number of references. The work should cover the following points:

Statement of the problem & Why the problem is important.
How the problem was faced —including a description of methodology and dataset(s).
Discussion of results, findings, and limitations of the study.
Related literature & Final remarks/conclusions.

The final paper should be ideally based on the student’s mid-term hackathon project. Text with other group members cannot be shared, figures/tables can be shared when appropriate with proper credit attribution. Grading will be based on soundness (both quality and quantity of original work). Groups of 2 students will be allowed to turn in a single joint-authored manuscript, in the format of a submission for an appropriate peer-reviewed journal or conference. Each author must contribute sufficient material to justify his/her “equal contribution” in the work. Both authors will receive the same grade for such manuscript.

Final remarks

We would like to hear from anyone who has a disability or other issues that may require some modifications or class adjustments to be made. The offices of Disability Services and Psychological Services are available for assistance to students. Please see the instructor after class or during office hours.

We welcome feedback on the class organization, material, lectures, assignments, and exams. You can provide us with constructive criticism. Please share your comments and suggestions so that we can improve the class.

On the cover: Johnny Mnemonic

Johnny Mnemonic is a cyberpunk sci-fi cult movie from the early nineteens, adapted from the homonymous short story by William Gibson. It tells the story of Johnny, a data courier, nicely interpreted by a Keanu Reeves in his early career, struggling with a huge load of data stored in his head. I find it a nice metaphor of our current data-overloaded society, and, coincidentally, is one of the defining movies I watched as a teenager that brought me to love Computer
Science.

[version 0.5: February 10, 2023] – Illustrations by Midjourney

Syllabus

Part 1—Networks

Week One

Introduction of the course
Crash introduction to Networks—Statistical descriptors of networks.

Week Two

Network clustering. Modularity and community detection.
Readings: Papers [27], [31], [22] and [4]
Recommended Chapters: NSB:1 and NSB:2; NBS:9 and WDM:7.5

Week Three

Dynamics of information and epidemics spreading.
Readings: Papers [26], [5], [6], [14]
Recommended Chapters: NSB:10.1–10.3[pp.11–29]
Hands-on session: mining Twitter.
Readings: Papers [12]
Recommended Chapters: MtSW:1[pp.5-26]
Documentation: Twitter API (https://dev.twitter.com/)

Week Four

Networks and manipulation: bots, disinformation, emotional contagion
Readings: [30], [39], [18], and [20]

Week Five

Guest speaker: Prof. Kristina Lerman
Bias in networks: friendship paradoxes & perception bias – network structures bias perception.
Readings: Papers [33], [23], [24]
Hands-on session: tutorial on Gephi.
Readings: Papers [3], [1], and [28]
Recommended Chapters: NSB:10.4–10.7[pp.30–58]
Documentation: Gephi Wiki https://wiki.gephi.org/index.php/Main_Page

Part 2—Text and Documents

Week Six

Crash intro to Natural Language Processing: Part-of-Speech Tagging.
Readings: Papers [16] and [40]
Recommended Chapters: WDM:6.5 and MtSW:5.3–5.5[pp.190–222]
Hands-on session: Tutorial on NLP

Week Seven

Sentiment Analysis
Readings: Papers [15] and [29]
Recommended Chapters: MtSW:4[pp.135–180]
Topic modeling
Readings: Papers [2] and [10]
Recommended Chapters: WDM:6.7
Hands-on session: Tutorial on Sentiment Analysis and Topic Modeling

Week Eight

Large Language Models
Readings: Papers [42] to [46]
Hands-on session: Tutorial on LLMs

Week Nine: Mid-term Hackathon week

Mid-term Hackathon presentations

Week Ten

No Classes

Part 3—Supervised Learning

Week Eleven

Crash intro to Supervised learning.
Readings: Papers [17] and [11] — Chapters: WDM:3.1
Eager vs. Lazy learning—Decision Trees
Readings: Papers [21] — Chapters: WDM:3.2 and WDM:3.9

Week Twelve

Ensemble methods, bagging and boosting & Classification performance evaluation.
Readings: Papers [9] — Chapters: WDM:3.3 and WDM:3.10

Part 4—Unsupervised Learning

Week Thirteen

Crash introduction to Unsupervised learning—Distance measures & K-means clustering.
Readings: Papers [38] and [37] — Chapters: WDM:4.1–4.3[pp.133–147]

Week Fourteen

Hierarchical clustering & Dendrograms.
Readings: Papers [24] — WDM:4.3–4.5[pp.147–155]

Finals Week

Project presentations
Final paper submissions

Reading list

[1] S. Aral and D. Walker. Identifying influential and susceptible members of social networks. Science, 337(6092):337–341, 2012.

[2] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

[3] R. M. Bond, C. J. Fariss, J. J. Jones, A. D. Kramer, C. Marlow, J. E. Settle, and J. H. Fowler. A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415):295–298, 2012.

[4] S. P. Borgatti, A. Mehra, D. J. Brass, and G. Labianca. Network analysis in the social sciences. Science, 323(5916):892–895, 2009.

[5] D. Centola. The spread of behavior in an online social network experiment. Science, 329(5996):1194–1197, 2010.

[6] D. Centola. An experimental study of homophily in the adoption of health behavior. Science, 334(6060):1269–1272, 2011.

[7] A. Cho. Ourselves and our interactions: the ultimate physics problem? Science, 325(5939):406, 2009.

[8] D. J. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Huttenlocher, and J. Kleinberg. Inferring social ties from geographic coincidences. Proceedings of the National Academy of Sciences, 107(52):22436–22441, 2010.

[9] V. Dhar. Data science and prediction. Communications of the ACM, 56(12):64–73, 2013.

[10] P. S. Dodds, R. Muhamad, and D. J. Watts. An experimental study of search in global social networks. Science, 301(5634):827–829, 2003.

[11] P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10):78–87, 2012.

[12] W. Fan and M. D. Gordon. The power of social media analytics. Communications of the ACM, 57(6):74–81, 2014.

[13] H. Garcia-Molina, G. Koutrika, and A. Parameswaran. Information seeking: convergence of search, recommendations, and advertising. Communications of the ACM, 54(11):121–130, 2011.

[14] Lorenz-Spreen, P., Mønsted, B. M., Hövel, P., & Lehmann, S. (2019). Accelerating dynamics of collective attention. Nature communications, 10(1), 1759.

[15] S. A. Golder and M. W. Macy. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051):1878–1881, 2011.

[16] DiMaggio, P. (2015). Adapting computational text analysis to social science (and vice versa). Big Data & Society, 2(2), 2053951715602908.

[17] N. Jones. Computer science: The learning machines. Nature, 505(7482):146, 2014.

[18] Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359(6380), 1146-1151.

[19] M. Kosinski, D. Stillwell, and T. Graepel. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15):5802–5805, 2013.

[20] A. D. Kramer, J. E. Guillory, and J. T. Hancock. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, page 201320040, 2014.

[21] D. Lazer, R. Kennedy, G. King, and A. Vespignani. Big data. the parable of google flu: traps in big data analysis. Science, 343(6176):1203, 2014.

[22] D. Lazer, A. S. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, et al. Life in the network: the coming age of computational social science. Science, 323(5915):721, 2009.

[23] Lee, E., Karimi, F., Wagner, C., Jo, H. H., Strohmaier, M., & Galesic, M. (2019). Homophily and minority-group size explain perception biases in social networks. Nature human behaviour, 3(10), 1078-1087.

[24] Kooti, F., Hodas, N. O., & Lerman, K. (2014, May). Network weirdness: Exploring the origins of network paradoxes. In Eighth International AAAI Conference on Weblogs and Social Media.

[25] D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. Geographic routing in social networks. Proceedings of the National Academy of Sciences of the United States of America, 102(33):11623–11628, 2005.

[26] P. T. Metaxas and E. Mustafaraj. Social media and the elections. Science, 338(6106):472–473, 2012.

[27] P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela. Community structure in time-dependent, multiscale, and multiplex networks. Science, 328(5980):876–878, 2010.

[28] L. Muchnik, S. Aral, and S. J. Taylor. Social influence bias: A randomized experiment. Science, 341(6146):647–651, 2013.

[29] Stella, M., Ferrara, E., & De Domenico, M. (2018). Bots increase exposure to negative and inflammatory content in online social systems. Proceedings of the National Academy of Sciences, 115(49), 12435-12440.

[30] Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (2016). The rise of social bots. Communications of the ACM, 59(7), 96-104.

[31] M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4):1118–1123, 2008.

[32] M. J. Salganik, P. S. Dodds, and D. J. Watts. Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311(5762):854–856, 2006.

[33] Feld, S. L. (1991). Why your friends have more friends than you do. American Journal of Sociology, 96(6), 1464-1477.

[34] M. Schich, C. Song, Y.-Y. Ahn, A. Mirsky, M. Martino, A.-L. Barabási, and D. Helbing. A network framework of cultural history. Science, 345(6196):558–562, 2014.

[35] C. Staff. Recommendation algorithms, online privacy, and more. Communications of the ACM, 52(5):10–11, 2009.

[36] G. Szabo and B. A. Huberman. Predicting the popularity of online content. Communications of the ACM, 53(8):80–88, 2010.

[37] A. Vespignani. Modelling dynamical processes in complex socio-technical systems. Nature Physics, 8(1):32–39, 2012.

[38] A. Vespignani. Predicting the behavior of techno-social systems. Science, 325(5939):425, 2009.

[39] Bessi, A., & Ferrara, E. (2016). Social bots distort the 2016 US Presidential election online discussion. First Monday, 21(11-7).

[40] Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of big data: A hybrid approach to computational and manual methods. Journal of broadcasting & electronic media, 57(1), 34-52.

[41] Wallach, H. (2018). Computational social science≠ computer science+ social data. Communications of the ACM, 61(3), 42-44.

[42] Li, H. (2022). Language models: past, present, and future. Communications of the ACM, 65(7), 56-63.

[43] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

[44] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[45] Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610-623).

[46] Greengard, S. (2023). Computational Linguistics Finds its Voice. Communications of the ACM, 66(2), 18-20.

Data Science for Communication & Social Networks