I decided to create the dataset used for the project myself. I used the Reddit API and Python library wrapper called Praw.
The data I wanted to collect was posts from different subreddits. My methodology for collecting the data was as follows:
NUM_POSTS_FROM_SUB
posts from the subreddit.MIN_TIMES_POSTED
posts in those NUM_POSTS_FROM_SUB
posts.NUM_POSTS_FROM_USER
posts from their profile and add the subreddits they are posted in to the queue if they weren't already in there.MAX_DEPTH
from the starting subreddit.In order for the following code to work, you need to create a file called reddit_secrets.py
in the same directory as this notebook. The file should contain the secrets you got from reddit when you created your reddit app. The file should look like this:
CLIENT_ID = "XXXXXXXXXXXXXXXXXXX"
CLIENT_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXX"
USER_AGENT = "your_app_name"
import praw
import os
import pickle
import csv
import random
import math
import imageio
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from tabulate import tabulate
from collections import defaultdict
from copy import deepcopy
from IPython.display import clear_output
# Import the secrets
from reddit_secrets import CLIENT_ID, CLIENT_SECRET, USER_AGENT
# Directory for raw data
DATA_DIRECTORY = 'data'
# Directory for processed data used in Cytoscape
NETWORKS_DIRECTORY = 'networks'
DATA_PATH = os.path.join(os.getcwd(), DATA_DIRECTORY)
NETWORKS_PATH = os.path.join(os.getcwd(), NETWORKS_DIRECTORY)
if not os.path.exists(DATA_PATH):
os.mkdir(DATA_PATH)
if not os.path.exists(NETWORKS_PATH):
os.mkdir(NETWORKS_PATH)
The script below was running for about 8h on my machine before it encountered an internal server error from Reddit. Unfortunately it didn't even managed to process all subreddits at depth 2. I decided to create a method that would allow me to resume the process from the last subreddit that was processed. That also allows me to stop and resume the process on demand.
The script at the start tries to load the state from the pickle file. If it fails, it means that the file doesn't exist and the script starts from scratch. If it succeeds, it means that the file exists and the script resumes from the last subreddit that was processed.
SCRIPT_SAVE_PATH = os.path.join(os.getcwd(), 'script_save.pkl')
script_save = None
try:
with open(SCRIPT_SAVE_PATH, 'rb') as f:
script_save = pickle.load(f)
print("Loaded script save. Resuming...")
print("NUM_POSTS_FROM_SUB:", script_save["NUM_POSTS_FROM_SUB"])
print("NUM_POSTS_OF_USER:", script_save["NUM_POSTS_OF_USER"])
print("MIN_TIMES_POSTED:", script_save["MIN_TIMES_POSTED"])
print("MAX_DEPTH:", script_save["MAX_DEPTH"])
print("Number of subreddits in queue:", len(script_save["sub_q"]))
print("Number of posts saved so far:", script_save["num_posts_saved"])
except:
print("No script save found. Starting from scratch...")
NUM_POSTS_FROM_SUB = 500 if script_save is None else script_save["NUM_POSTS_FROM_SUB"]
NUM_POSTS_OF_USER = 5 if script_save is None else script_save["NUM_POSTS_OF_USER"]
MIN_TIMES_POSTED = 2 if script_save is None else script_save["MIN_TIMES_POSTED"]
MAX_DEPTH = 5 if script_save is None else script_save["MAX_DEPTH"]
sub_q = ["programming"] if script_save is None else script_save["sub_q"]
sub_depths = {sub_q[0]: 0} if script_save is None else script_save["sub_depths"]
skipped_subs = [] if script_save is None else script_save["skipped_subs"]
reddit = praw.Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=USER_AGENT)
num_posts_saved = 0 if script_save is None else script_save["num_posts_saved"]
# BFS; take the first subreddit from the queue.
while len(sub_q) > 0 and (sub:=sub_q.pop(0)):
print("=========================================")
print(f"Processing '{sub}' on depth {sub_depths[sub]}")
print(f"Queue size: {len(sub_q)}")
print(f"Num posts saved so far: {num_posts_saved}")
posts = None
try:
# Download posts from the subreddit
posts = list(reddit.subreddit(sub).top(limit=NUM_POSTS_FROM_SUB, time_filter="all"))
except:
print(f"ERROR: Cannot access '{sub}'")
skipped_subs.append(sub)
if posts is not None:
if len(posts) < NUM_POSTS_FROM_SUB:
print(f"Only {len(posts)} posts found")
data_df = pd.DataFrame(
[[post.title, post.score, post.id, post.url, post.num_comments, post.created, post.author, post.upvote_ratio, post.permalink, post.subreddit, post.subreddit_subscribers, sub_depths[sub]] for post in posts],
columns=["title", "score", "id", "url", "num_comments", "created", "author", "upvote_ratio", "permalink", "subreddit", "subreddit_subscribers", "depth"],
)
# Filter out posts made by deleted users
data_df = data_df[data_df["author"].notna()]
# Keep only authors that posted more then MIN_TIMES_POSTED times
data_df["author_name"] = data_df["author"].apply(lambda x: x.name)
data_df = data_df.groupby("author_name").filter(lambda x: len(x) >= MIN_TIMES_POSTED)
data_df = data_df.drop(columns=["author_name"])
authors = data_df["author"].unique()
num_posts = len(data_df)
num_posts_saved += num_posts
print(f"Num posts after filtering out: {num_posts} from {len(authors)} authors")
# Check if we reached the max depth
if sub_depths[sub] >= MAX_DEPTH:
print("Max depth reached")
else:
for author in authors:
try:
# Try to get submissions of the author
user_submissions = list(author.submissions.top(limit=NUM_POSTS_OF_USER, time_filter="all"))
# Extract subreddits from top user submissions and add them to the queue
for submission in user_submissions:
sub_name = submission.subreddit.display_name
if sub_name not in sub_depths:
sub_q.append(sub_name)
sub_depths[sub_name] = sub_depths[sub] + 1
except:
print(f"ERROR: User submissions are private for '{author}'")
# Save dataframe to csv
data_df.to_csv(f"{DATA_PATH}/posts_{sub}.csv", index=False)
# Save the script state to be able to resume in case of an error
script_save = {
"NUM_POSTS_FROM_SUB": NUM_POSTS_FROM_SUB,
"NUM_POSTS_OF_USER": NUM_POSTS_OF_USER,
"MIN_TIMES_POSTED": MIN_TIMES_POSTED,
"MAX_DEPTH": MAX_DEPTH,
"sub_q": sub_q,
"sub_depths": sub_depths,
"num_posts_saved": num_posts_saved,
"skipped_subs": skipped_subs,
}
with open(SCRIPT_SAVE_PATH, 'wb') as f:
pickle.dump(script_save, f)
Loaded script save. Resuming... NUM_POSTS_FROM_SUB: 500 NUM_POSTS_OF_USER: 5 MIN_TIMES_POSTED: 2 MAX_DEPTH: 5 Number of subreddits in queue: 7314 Number of posts saved so far: 149894
I decided to stop the script manually after it collected around 150k posts even though it didn't reach the maximum depth. It took about 20h to collect that amount of data. I think that's enough for the project.
# Create the main dataframe
posts_df = pd.DataFrame()
# Load all the raw csv files into the main dataframe
for _root, _dirs, files in os.walk(DATA_PATH):
for file in files:
if file.endswith(".csv"):
posts_df = pd.concat([posts_df, pd.read_csv(os.path.join(DATA_PATH, file))], ignore_index=True)
display(posts_df.info())
display(posts_df.sample(5))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 149894 entries, 0 to 149893 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 149894 non-null object 1 score 149894 non-null object 2 id 149894 non-null object 3 url 149894 non-null object 4 num_comments 149894 non-null object 5 created 149894 non-null float64 6 author 149894 non-null object 7 upvote_ratio 149894 non-null float64 8 permalink 149894 non-null object 9 subreddit 149894 non-null object 10 subreddit_subscribers 149894 non-null object 11 depth 149894 non-null object dtypes: float64(2), object(10) memory usage: 13.7+ MB
None
title | score | id | url | num_comments | created | author | upvote_ratio | permalink | subreddit | subreddit_subscribers | depth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
88318 | This looks like plastic, feels like plastic, b... | 117971 | kg5yxj | https://v.redd.it/9oq4dntgl4661 | 2390 | 1.608376e+09 | mohiemen | 0.91 | /r/nextfuckinglevel/comments/kg5yxj/this_looks... | nextfuckinglevel | 7785630 | 2 |
38468 | Three Free EE Textbooks | 106 | 9bggew | https://www.reddit.com/r/ECE/comments/9bggew/t... | 16 | 1.535603e+09 | itstimeforanexitplan | 0.99 | /r/ECE/comments/9bggew/three_free_ee_textbooks/ | ECE | 154880 | 3 |
115676 | Failed Attempt by a Security Guard to Stop a F... | 61127 | 97jkm2 | https://i.imgur.com/SLs41rI.gifv | 1106 | 1.534350e+09 | BunyipPouch | 0.92 | /r/sports/comments/97jkm2/failed_attempt_by_a_... | sports | 20617425 | 2 |
99413 | McConnell blocks House bill to reopen governme... | 85236 | agabcf | https://thehill.com/homenews/senate/425414-mcc... | 7360 | 1.547570e+09 | emitremmus27 | 0.85 | /r/politics/comments/agabcf/mcconnell_blocks_h... | politics | 8294111 | 1 |
96225 | PsBattle: A sculpture of a woman made out of w... | 42731 | bueyhn | https://i.redd.it/r72ugyjao5131.jpg | 501 | 1.559138e+09 | fjordfjord | 0.88 | /r/photoshopbattles/comments/bueyhn/psbattle_a... | photoshopbattles | 18270824 | 2 |
The data I decided to collect has the following columns:
title
- title of the postscore
- score of the postid
- id of the posturl
- url to content shared in the post (image, video, etc.)num_comments
- number of comments on the postcreated
- timestamp of the post creationauthor
- author of the postupvote_ratio
- ratio of upvotes to downvotes on the postpermalink
- url to the postsubreddit
- subreddit the post was posted insubreddit_subscribers
- number of subscribers to the subreddit the post was posted indepth
- depth of the subreddit from the starting subredditDuring creation of the networks presented in the next sections of the project I discovered some anomalies. After several hours of investigation I realized that some subreddits names and usernames are the same. That leads to problems with node identification in the network. I decided to add a prefix to the subreddit names and usernames to avoid those problems. The prefix for subreddits is r/
and for usernames is u/
. This will solve the problem as /
is not allowed in subreddit names and usernames.
posts_df["subreddit"] = posts_df["subreddit"].apply(lambda x: f"r/{x}")
posts_df["author"] = posts_df["author"].apply(lambda x: f"u/{x}")
posts_df.sample(5)
title | score | id | url | num_comments | created | author | upvote_ratio | permalink | subreddit | subreddit_subscribers | depth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
42861 | Ferrari World looks like a virus infecting the... | 21742 | cqx8gj | https://i.imgur.com/bolY368.jpg | 412 | 1.565909e+09 | u/Ayo-Glam | 0.92 | /r/evilbuildings/comments/cqx8gj/ferrari_world... | r/evilbuildings | 1084637 | 3 |
113295 | Knot (by More and More) | 14219 | 9h55p7 | https://gfycat.com/DefinitiveTepidGalapagosmoc... | 284 | 1.537364e+09 | u/KevlarYarmulke | 0.98 | /r/Simulated/comments/9h55p7/knot_by_more_and_... | r/Simulated | 1233460 | 2 |
101604 | It's already been a year since Neil Peart pass... | 149 | ks67ex | https://youtu.be/EsBNzf5JlZA | 4 | 1.609996e+09 | u/juanp2350 | 0.99 | /r/progrockmusic/comments/ks67ex/its_already_b... | r/progrockmusic | 51318 | 2 |
73178 | I spyk ze engliš very gud. | 758 | flqvzz | https://i.redd.it/6hdtigua5sn41.jpg | 70 | 1.584689e+09 | u/KyouHarisen | 0.98 | /r/lithuania/comments/flqvzz/i_spyk_ze_engliš_... | r/lithuania | 90712 | 2 |
83625 | Extra horsepower won't do any harm - GG | 816 | ic53kb | https://i.redd.it/ulkwj6btmsh51.jpg | 54 | 1.597771e+09 | u/DontKillUncleBen | 0.98 | /r/motogp/comments/ic53kb/extra_horsepower_won... | r/motogp | 289502 | 2 |
num_subredits = len(posts_df["subreddit"].unique())
num_authors = len(posts_df["author"].unique())
num_posts = len(posts_df)
print(f"Collected {num_posts} posts from {num_subredits} subreddits and {num_authors} authors")
Collected 149894 posts from 1030 subreddits and 32311 authors
Let's analyze the number of posts collected from each subreddit.
num_posts_per_sub = posts_df.groupby("subreddit").size().reset_index(name="num_posts")
display(num_posts_per_sub.sample(10))
# Plot the density of the number of posts per subreddit
sns.histplot(num_posts_per_sub["num_posts"], stat="density", bins=100, kde=True)
plt.title("Density of the number of posts per subreddit")
plt.xlabel("Number of posts")
plt.ylabel("Density")
plt.show()
subreddit | num_posts | |
---|---|---|
904 | r/trance | 181 |
450 | r/assassinscreed | 167 |
315 | r/PornhubComments | 47 |
728 | r/netsec | 129 |
590 | r/france | 128 |
729 | r/nevertellmetheodds | 84 |
322 | r/ProgrammingLanguages | 237 |
860 | r/submechanophobia | 74 |
910 | r/trees | 65 |
627 | r/hiphopheads | 114 |
Because we decided to collect 500 posts from each subreddit (NUM_POSTS_FROM_SUB
) and then disregarded posts from users that have less than 2 posts in that subreddit (MIN_TIMES_POSTED
), the number of posts will always be 500 or less. If the subreddit has exactly 500 posts, that means that all the most popular posts were made by the same user. That would be highly unlikely for popular subreddits visited by many different people, so probably the subreddits from which we collected near to 500 posts are less popular. Let's check that.
In order to do that let's create a dataframe with number of subscribers of each subreddit. We have to use mean function and then round the result because during the data collection process the number of subscribers could have changed.
num_subscribers_per_sub = posts_df.groupby("subreddit").agg("subreddit_subscribers").mean().round().astype(int).reset_index(name="subscribers")
display(num_subscribers_per_sub.sample(10))
subreddit | subscribers | |
---|---|---|
902 | r/totalwar | 385442 |
632 | r/holdmyjuicebox | 745395 |
763 | r/perfectloops | 667931 |
331 | r/Repsneakers | 754961 |
291 | r/PHP | 156263 |
838 | r/skyrim | 1445471 |
211 | r/Jokes | 25595814 |
226 | r/LateStageCapitalism | 837508 |
863 | r/suspiciouslyspecific | 1257476 |
485 | r/brooklynninenine | 709603 |
Let's plot the number of posts collected from each subreddit against the number of subscribers to that subreddit. We have to use log scale for x axis because the number of subscribers is very skewed. As we cannot have value of 0 on log scale, we have to add 1 to the number of subscribers if the number of subscribers is 0
num_posts_and_subscribers_per_sub = num_posts_per_sub.merge(num_subscribers_per_sub, on="subreddit")
num_posts_and_subscribers_per_sub["subscribers"] = num_posts_and_subscribers_per_sub["subscribers"].apply(lambda x: 1 if x == 0 else x)
plt.figure(figsize=(20, 10))
sns.scatterplot(data=num_posts_and_subscribers_per_sub, x="subscribers", y="num_posts")
plt.xscale("log")
plt.title("Number of posts vs number of subscribers for each subreddit")
plt.xlabel("Number of subscribers")
plt.ylabel("Number of posts")
# Set the xticks, taking into account the trick of changing the 0 to 1
plt.xticks([10**i for i in range(9)], ["0", "10", "100", "1k", "10k", "100k", "1M", "10M", '100M'])
# Plot the line to mark the tendency of the data
plt.plot([10**i for i in range(3, 8)], [500/4*(4-i) for i in range(5)], color="red", linestyle="--")
plt.show()
We can see that the hypothesis is somewhat correct. Points in the most dense part of the graph tend to align with it (red dashed line).
Let's plot the distribution of the number of subscribers to subreddits from which we collected posts. We have already created the dataframe for the previous plot, so we can just use it. The plot below shows why we had to use log scale for the previous plot.
sns.histplot(num_subscribers_per_sub["subscribers"], log_scale=(False, True), stat="density", bins=100)
plt.title("Distribution of the number of subscribers per subreddit")
plt.xlabel("Number of subscribers")
plt.ylabel("Density")
plt.show()
Let's see the distribution of number of subreddits each user posted in. That is important as the number of subreddits each user posted in is the number of edges that will be created for the node representing that user (its degree) in the bipartite network presented in the next section.
num_subs_per_user = posts_df.groupby(["author", "subreddit"]).size().groupby("author").size().sort_values(ascending=False).reset_index(name="num_subs")
display(num_subs_per_user.head(10))
# Plot the density of the number of subreddits per author in log scale
sns.histplot(num_subs_per_user["num_subs"], discrete=True, stat="density", log_scale=(False, True))
plt.xlabel("Number of subreddits")
plt.title("Density of the number of subreddits per user")
plt.show()
author | num_subs | |
---|---|---|
0 | u/My_Memes_Will_Cure_U | 63 |
1 | u/Master1718 | 60 |
2 | u/memezzer | 49 |
3 | u/KevlarYarmulke | 47 |
4 | u/5_Frog_Margin | 45 |
5 | u/GallowBoob | 40 |
6 | u/Scaulbylausis | 36 |
7 | u/kevinowdziej | 33 |
8 | u/icant-chooseone | 29 |
9 | u/AristonD | 28 |
We can see that user u/My_Memes_Will_Cure_U
is the user that posted in the most subreddits. That means that the node representing that user will have the highest degree among all nodes representing users in the bipartite network.
Let's see the distribution of number of posts collected for each user.
# Plot the distribution of the number of posts per author
num_posts_per_user = posts_df.groupby("author").size().sort_values(ascending=False).reset_index(name="num_posts")
display(num_posts_per_user.head(10))
sns.histplot(num_posts_per_user["num_posts"], stat="density", bins=100)
plt.yscale("log")
plt.title("Density of the number of posts per author")
plt.xlabel("Number of posts")
plt.ylabel("Density")
plt.show()
author | num_posts | |
---|---|---|
0 | u/SrGrafo | 1077 |
1 | u/GallowBoob | 1069 |
2 | u/Andromeda321 | 775 |
3 | u/Yellyvi | 763 |
4 | u/My_Memes_Will_Cure_U | 725 |
5 | u/Unicornglitteryblood | 516 |
6 | u/pdwp90 | 506 |
7 | u/ZadocPaet | 485 |
8 | u/mvea | 450 |
9 | u/flovringreen | 430 |
We can see that the user u/My_Memes_Will_Cure_U
is quite high up on the list. Another interesting fact is that we can observe an almost empty area on the x axis from around 500 posts to 1000 posts and then a sudden spike. That means that most of the users are somewhat active and a few users are extremely active on Reddit and there aren't many users in between.
We can also plot the number of posts each user posted against the number of subreddits each user posted in. In that way we can see if there is any correlation between those two values. We will also plot the line y = x/2
as each user had to post at least 2 (MIN_TIMES_POSTED
) times in a given subreddit to be included in the dataset.
# Plot the number of posts of each user against the number of subreddits they posted in
num_posts_and_subs_per_user = num_posts_per_user.merge(num_subs_per_user, on="author")
plt.figure(figsize=(20, 10))
sns.scatterplot(data=num_posts_and_subs_per_user, x="num_posts", y="num_subs", alpha=0.3)
plt.xscale("log")
plt.title("Number of subreddits vs number of posts for each user")
plt.xlabel("Number of posts")
plt.ylabel("Number of subreddits")
# Plot y = 2x
x = [i for i in range(num_subs_per_user["num_subs"].max() * 2)]
plt.plot(x, [i/2 for i in x], color="red", linestyle="--")
plt.show()
Number of users that posted in each subreddit will also be important for the bipartite network as it will determine the number of edges that will be created for the node representing that subreddit (its degree). Let's see the distribution of that. Remember that we discarded users that posted less than 2 (MIN_TIMES_POSTED
) times in the subreddit and we collected 500 (NUM_POSTS_FROM_SUB
) posts from each subreddit so the maximum number of users that posted in a subreddit is 250 (that would mean that each user created exactly 2 posts among those 500 top posts).
num_users_per_sub = posts_df.groupby(["subreddit", "author"]).size().groupby("subreddit").size().sort_values(ascending=False).reset_index(name="num_users")
display(num_users_per_sub.head(10))
sns.histplot(num_users_per_sub["num_users"], stat="density", bins=100)
plt.title("Density of the number of users per subreddit")
plt.xlabel("Number of users")
plt.ylabel("Density")
plt.show()
subreddit | num_users | |
---|---|---|
0 | r/generative | 93 |
1 | r/Unity2D | 92 |
2 | r/avatartrading | 81 |
3 | r/Cinema4D | 81 |
4 | r/dalmatians | 80 |
5 | r/KTMDuke | 80 |
6 | r/ukraine | 80 |
7 | r/turning | 78 |
8 | r/animation | 78 |
9 | r/Simulated | 78 |
We can see that the subreddit r/generative
is the subreddit from which we collected posts from the highest number of users. That means that the node representing that subreddit will have the highest degree among all nodes representing subreddits in the bipartite network.
We can also notice, that the distribution follows a normal distribution quite well, with an expected value of around 40. There is, however, a huge spike at value of 1. That means that there are a lot of subreddits from which we collected posts from only one user. That would make sense as there are some subreddits that have restricted posting permissions. Usually they are used as a private boards of an user.
It is also worth noting that on big subreddits it is hard to get one's post to the top 500. It is quite common that each of those 500 posts is made by a different user. Because I decided to discard posts of users that posted less than 2 times (MIN_TIMES_POSTED
) in the subreddit, it is possible that usually there were only 1 user that managed to get 2 of his post to the top 500. Probably if we have removed that restriction, the distribution would have been much more right side heavy (in the area close to 500).
First network I decided to create is a bipartite network of subreddits and users. There will be an edge between a subreddit and a user if there are at least MIN_TIMES_POSTED
posts from that user in that subreddit. There is also an edge attribute num_posts
that stores the number of posts from that user in that subreddit. It can be then used to calculate the weight of the edge if needed.
# Create a dataframe with all the author-subreddit pairs
user_sub_pairs = posts_df.groupby(["author", "subreddit"]).size().reset_index(name="num_posts")
display(user_sub_pairs.head(10))
# Save the author-subreddit pairs to a csv file that could be imported to Cytoscape
user_sub_pairs.to_csv(f"{NETWORKS_PATH}/bipartite.csv", index=False)
author | subreddit | num_posts | |
---|---|---|---|
0 | u/--5- | r/india | 2 |
1 | u/--CreativeUsername | r/Physics | 2 |
2 | u/--Fatal-- | r/homelab | 2 |
3 | u/--MVH-- | r/Netherlands | 4 |
4 | u/--Speed-- | r/logodesign | 2 |
5 | u/--UNFLAIRED-- | r/carscirclejerk | 2 |
6 | u/--Yami_Marik-- | r/WatchPeopleDieInside | 3 |
7 | u/--Yami_Marik-- | r/holdmycosmo | 2 |
8 | u/-AMARYANA- | r/Awwducational | 2 |
9 | u/-AMARYANA- | r/Buddhism | 7 |
The next thing I decided to do was to create csv files which contains the nodes' attributes that could be used to style the network in Cytoscape. First file contains the attributes of the subreddits:
id
- id of the nodeis_user
- boolean value indicating if the node is a subreddit or a user, in this case it is always false
subscribers
- number of subscribers of a subredditnum_posts
- number of posts collected from that subredditnum_users
- number of users that posted in that subredditBecause we have already created a dataframe with number of subscribers of each subreddit, we can just use it to create the csv file.
sub_data = num_subscribers_per_sub.copy()
sub_data = sub_data.merge(num_posts_per_sub, on="subreddit")
sub_data = sub_data.merge(num_users_per_sub, on="subreddit")
sub_data = sub_data.rename(columns={"subreddit": "id"})
sub_data = sub_data.sort_values(by="subscribers", ascending=False)
# Add column `is_user` with value `False` to indicate that the nodes are subreddits
sub_data["is_user"] = False
display(sub_data.head(10))
# Save the dataframe to a csv file
sub_data.to_csv(f"{NETWORKS_PATH}/bipartite_sub_data.csv", index=False)
id | subscribers | num_posts | num_users | is_user | |
---|---|---|---|---|---|
442 | r/announcements | 202719824 | 138 | 21 | False |
596 | r/funny | 48108476 | 60 | 17 | False |
42 | r/AskReddit | 40249936 | 13 | 4 | False |
604 | r/gaming | 36492322 | 75 | 22 | False |
461 | r/aww | 33655974 | 112 | 31 | False |
273 | r/Music | 32043294 | 145 | 37 | False |
1020 | r/worldnews | 31254656 | 133 | 37 | False |
899 | r/todayilearned | 31041441 | 114 | 39 | False |
720 | r/movies | 30617572 | 252 | 35 | False |
772 | r/pics | 29880182 | 72 | 24 | False |
The second file contains the attributes of the users:
id
- id of the nodeis_user
- boolean value indicating if the node is a subreddit or a user, in this case it is always true
total_score
- total score of all posts of a usernum_posts
- total number of posts of a user# Create a dataframe with author data
user_data = posts_df.groupby("author")["score"].sum().sort_values(ascending=False).reset_index()
user_data = user_data.merge(num_posts_per_user, on="author")
user_data = user_data.rename(columns={"score": "total_score", "author": "id"})
user_data["is_user"] = True
display(user_data.head(10))
# Save the dataframe to a csv file that could be imported to Cytoscape
user_data.to_csv(f"{NETWORKS_PATH}/bipartite_user_data.csv", index=False)
id | total_score | num_posts | is_user | |
---|---|---|---|---|
0 | u/My_Memes_Will_Cure_U | 28764321 | 725 | True |
1 | u/beerbellybegone | 24308427 | 260 | True |
2 | u/mvea | 20158958 | 450 | True |
3 | u/GallowBoob | 18798098 | 1069 | True |
4 | u/SrGrafo | 16470408 | 1077 | True |
5 | u/Master1718 | 15458144 | 403 | True |
6 | u/DaFunkJunkie | 15101854 | 202 | True |
7 | u/memezzer | 11689024 | 283 | True |
8 | u/unnaturalorder | 10315964 | 158 | True |
9 | u/kevinowdziej | 9681202 | 187 | True |
I imported the user_sub_pairs
dataframe from the previous section to Cytoscape but because the number of nodes was too big, Cytoscape was not able to calculate even the default (initial) layout. All the nodes were stacked on top of each other. I tried different layouts but the only one that managed to finish calculating itself was the circular layout. I decided to use that layout for the visualization of the network.
Furthermore, because of the problem with visualization, I decided to create a smaller network that could be visualized with other layouts. After filtrating out the posts from subreddits on depth 3 and above and posts in subreddits that have less than 500000 subscribers we are left with 57800 posts. That number of nodes created from those posts Cytoscape is able to process.
posts_df_filtered = posts_df[posts_df["depth"] <= 2]
print("Num of posts after 'depth <= 2':", len(posts_df_filtered))
posts_df_filtered = posts_df_filtered[posts_df_filtered["subreddit_subscribers"] >= 500000]
print("Num of posts after 'subreddit_subscribers >= 500000':", len(posts_df_filtered))
user_sub_pairs_filtered = posts_df_filtered.groupby(["author", "subreddit"]).size().reset_index(name="num_posts")
display(user_sub_pairs_filtered.head(10))
# Save the author-subreddit pairs to a csv file that could be imported to Cytoscape
user_sub_pairs_filtered.to_csv(f"{NETWORKS_PATH}/bipartite_filtered.csv", index=False)
Num of posts after 'depth <= 2': 121426 Num of posts after 'subreddit_subscribers >= 500000': 57800
author | subreddit | num_posts | |
---|---|---|---|
0 | u/--5- | r/india | 2 |
1 | u/--CreativeUsername | r/Physics | 2 |
2 | u/--Fatal-- | r/homelab | 2 |
3 | u/--Yami_Marik-- | r/WatchPeopleDieInside | 3 |
4 | u/--Yami_Marik-- | r/holdmycosmo | 2 |
5 | u/-AMARYANA- | r/Awwducational | 2 |
6 | u/-AMARYANA- | r/Buddhism | 7 |
7 | u/-AMARYANA- | r/Futurology | 3 |
8 | u/-AMARYANA- | r/spaceporn | 2 |
9 | u/-ARIDA- | r/photography | 2 |
Below we cen see a visualization of the full bipartite network created in Cytoscape. Styles used in the visualization:
is_user
- discrete mapping to colors: blue
for True
and red
for False
. (Red indicates that the node is a subreddit)subscribers
- continuous mapping to sizes of the nodes.num_posts
- continuous mapping to widths of the edges and their colors. (Darker and thicker edges indicate higher number of posts from a user in a subreddit)total_score
- no mapping. I would like to map total_score
to the size of the nodes if a node is a user and subscribers
to the size of the nodes if a node is a subreddit but unfortunately I didn't find a way to do that.As stated above, the circular layout is the only layout that managed to finish calculating itself. Nodes were sorted according to their type. We can see that all the edges from blue nodes (users) are pointing to the red ones (subreddits).
Below we can see a visualization of the smaller bipartite network created in Cytoscape. Styles used are the same as in the previous visualization.
Data summary of both networks:
# components | # nodes | # users | # subs | # edges | ||
---|---|---|---|---|---|---|
Unfiltered network | 51 | 33 341 | 32 311 | 1 030 | 38 054 | |
- | Largest component | 1 | 31 928 | 30 950 | 978 | 36 691 |
Filtered network | 38 | 13 166 | 12 707 | 459 | 15 672 | |
- | Largest component | 1 | 12 507 | 12 088 | 419 | 15 050 |
We can notice that the number of edges is lower than Metcalfe's law for social network would predict. The number of edges is linearly proportional to the number of nodes (~N
) and not to N*log(N)
.
From the visualization of the filtered network we can see that it is mainly dissasortative by degree (a lot of star-like structures).
Other metrics relevant for the bipartite network (maximum degree of each partition, degree distributions across users and subreddits, etc.) were calculated in the previous section.
Metrics such as average degree, average clustering coefficient, average path length, etc. don't make much sense for bipartite networks and won't be analyzed here. I'll focus on them in the next section when we'll be analyzing subreddits and users projections.
Let's create the networkx graph of the network full network.
G_bipartite = nx.Graph()
# Add nodes to the graph marking their partitions
for row in user_data.iterrows():
G_bipartite.add_node(
row[1]["id"],
bipartite="user",
total_score=row[1]["total_score"],
num_posts=row[1]["num_posts"],
)
for row in sub_data.iterrows():
G_bipartite.add_node(
row[1]["id"],
bipartite="sub",
subscribers=row[1]["subscribers"],
num_posts=row[1]["num_posts"],
num_users=row[1]["num_users"],
)
# Add edges to the graph
for row in user_sub_pairs.iterrows():
G_bipartite.add_edge(
row[1]["author"], row[1]["subreddit"], num_posts=row[1]["num_posts"]
)
# Add degree as a node attribute
for node in G_bipartite.nodes():
G_bipartite.nodes[node]["degree"] = G_bipartite.degree[node]
# Check if the graph is indeed bipartite
print(nx.is_bipartite(G_bipartite))
True
Let's create the users projection of the network. The projection is a graph where nodes are users and edges are created between users if they have at least one common subreddit. That way we will get a network with nodes of the same type which will allow us to analyze other metrics and comparisons with model networks.
To create a projection we could use a build in networkx function projected_graph
but it doesn't allow us to specify the number of common subreddits between them. Instead we will use the weighted_projected_graph
function which can achieve that.
# Create a projection
users_nodes = [node for node in G_bipartite.nodes() if G_bipartite.nodes[node]["bipartite"] == "user"]
G_users = nx.bipartite.weighted_projected_graph(G_bipartite, users_nodes, ratio=False)
print("Number of nodes:", G_users.number_of_nodes())
print("Number of edges:", G_users.number_of_edges())
print("\nSample node:")
print(list(G_users.nodes(data=True))[0])
print("\nSample edge:")
print(list(G_users.edges(data=True))[0])
Number of nodes: 32311 Number of edges: 870230 Sample node: ('u/My_Memes_Will_Cure_U', {'bipartite': 'user', 'total_score': 28764321, 'num_posts': 725, 'degree': 63}) Sample edge: ('u/My_Memes_Will_Cure_U', 'u/Rredite', {'weight': 1})
# Rename edge attributes 'weight' to 'common_subs'
for edge in list(G_users.edges()):
G_users.edges[edge]["num_common_subs"] = G_users.edges[edge].pop("weight")
num_common_subs = [G_users.edges[edge]["num_common_subs"] for edge in G_users.edges()]
# Plot the distribution of the number of common subs
sns.histplot(num_common_subs, stat="density", discrete=True)
plt.yscale("log")
plt.title("Distribution of the number of common subreddits between users")
plt.xlabel("Number of common subreddits")
plt.ylabel("Density")
plt.minorticks_on()
plt.show()
We can see that the distribution of the number of common subreddits is very 1-heavy. That means that vast majority of users have only one common subreddit. Because of that, this property will not be useful for styling the edges of the network.
# Save edgelist to csv file
with open(f"{NETWORKS_PATH}/users.csv", "w") as f:
writer = csv.writer(f, delimiter=",", lineterminator="\n")
writer.writerow(["source", "target", "num_common_subs"])
for edge in G_users.edges(data=True):
writer.writerow([edge[0], edge[1], edge[2]["num_common_subs"]])
This network has a comparable number of nodes to the bipartite network but the number of edges is a magnitude larger. The projection is much denser and because of that, this time Cytoscape had even more issues dealing with the network.
Because of that I tried to use different tool called Gephi. It is a tool for network visualization and analysis similar to Cytoscape, but according to the documentation and other sources found online, it can handle much larger networks. Despite that I still had performance issues and couldn't work or style the network in a desired way. After spending some time looking for solutions I found out that the issue was not with the tool itself (because it should handle a network of that size) but with the available resources on my computer. Maybe in the future I will try to visualize the whole network on a more powerful computer.
Below is the only visualization I was able to create with Gephi. It uses OpenOrd algorithm for calculating the layout, which was created specifically for large networks visualization.
In order to be able to see the individual links, there color of nodes was mapped continuously to the number of posts collected from each user, and than the color of edges was set to the same color as the source node. Styling doesn't convey much information, but at least enables us to see some of the connections.
Note that clearly visible dots at the edges of the network are not individual nodes, but clusters of many of them. Below you can see the zoomed in version of some nodes at the edge of the graph.
Below you can see a zoomed in version of some nodes at the center of the graph.
# Count the number of connected components in the graph
users_components = list(nx.connected_components(G_users))
print("Number of connected components:", len(users_components))
Number of connected components: 51
# Identify the largest connected component
users_components_sorted = sorted(users_components, key=len, reverse=True)
G_users_lcc = G_users.subgraph(users_components_sorted[0])
G_users_2nd_lcc = G_users.subgraph(users_components_sorted[1])
num_edges_complete_graph = G_users.number_of_nodes() * (G_users.number_of_nodes() - 1) / 2
components = [G_users, G_users_lcc, G_users_2nd_lcc]
components_data = []
for component in components:
components_data.append(
[
component.number_of_nodes(),
component.number_of_edges(),
round(component.number_of_nodes() / G_users.number_of_nodes() * 100, 4),
round(component.number_of_edges() / G_users.number_of_edges() * 100, 4),
round(component.number_of_edges() / num_edges_complete_graph * 100, 4),
]
)
table = [
[
"",
"# nodes",
"# edges",
f"node % of\nthe network",
f"edge % of\nthe network",
f"edge % of the\ncomplete graph"
],
["Network", *components_data[0]],
["LC", *components_data[1]],
["2nd LC", *components_data[2]],
]
print(tabulate(table, headers="firstrow", tablefmt="fancy_grid"))
╒═════════╤═══════════╤═══════════╤═══════════════╤═══════════════╤══════════════════╕ │ │ # nodes │ # edges │ node % of │ edge % of │ edge % of the │ │ │ │ │ the network │ the network │ complete graph │ ╞═════════╪═══════════╪═══════════╪═══════════════╪═══════════════╪══════════════════╡ │ Network │ 32311 │ 870230 │ 100 │ 100 │ 0.1667 │ ├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤ │ LC │ 30950 │ 843076 │ 95.7878 │ 96.8797 │ 0.1615 │ ├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤ │ 2nd LC │ 76 │ 2850 │ 0.2352 │ 0.3275 │ 0.0005 │ ╘═════════╧═══════════╧═══════════╧═══════════════╧═══════════════╧══════════════════╛
We can see that the largest connected component contains the vast majority of nodes and is a good representation of the whole network, so analysis will be focused only on it. This will also allow us to compute statistics that are only defined for connected graphs.
Let's create a function that will allow us to easily calculate the density of degrees for a given graph. It is necessary as using histograms when plotting comparisons of degree distributions is not easy to interpret.
def calculate_degree_densities(G: nx.Graph) -> pd.DataFrame:
degrees = [G.degree[node] for node in G.nodes()]
# Count the number of nodes with each degree
degree_counts = defaultdict(int)
for degree in degrees:
degree_counts[degree] += 1
# Create a dataframe with the degree and the number of nodes with that degree
df = pd.DataFrame.from_dict(degree_counts, orient="index", columns=["count"]).reset_index()
df = df.rename(columns={"index": "degree"})
# Calculate the density of each degree
df["density"] = df["count"] / G.number_of_nodes()
return df
# Calculate the average degree of the largest connected component
G_users_lcc_avg_degree = sum([G_users_lcc.degree[node] for node in G_users_lcc.nodes()]) / G_users_lcc.number_of_nodes()
G_users_lcc_degrees = calculate_degree_densities(G_users_lcc)
# Plot the degree distribution on a scatter plot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of the largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")
# Plot the average degree as a vertical line
plt.axvline(x=G_users_lcc_avg_degree, color="red", linestyle="--", label=f"Average degree: {round(G_users_lcc_avg_degree, 2)}")
plt.legend()
plt.show()
The distribution doesn't look like any of the distributions we have seen in the class. Let's also see how the distribution looks like when we plot in in linear scale.
# Plot the degree distribution on a scatter plot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density")
plt.title("Degree distribution of the largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.show()
That's really interesting. It looks like the distribution is a mixture of two distributions. One of them is a power law and the other is some distribution with a single peak.
Let's see how it compares to to the Erdos-Renyi random graph with the same number of nodes and edges.
# Create Erdos-Renyi random graph with the same number of nodes and edges as the largest connected component
num_edges_complete_graph = G_users_lcc.number_of_nodes() * (G_users_lcc.number_of_nodes() - 1) / 2
G_users_lcc_ER = nx.erdos_renyi_graph(G_users_lcc.number_of_nodes(), G_users_lcc.number_of_edges() / num_edges_complete_graph, seed=42)
G_users_lcc_ER_degrees = calculate_degree_densities(G_users_lcc_ER)
# Plot both degree distributions on a scatter plot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density", label="Users largest component")
sns.scatterplot(data=G_users_lcc_ER_degrees, x="degree", y="density", label="ER random graph")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of users largest connected component vs Erdos-Renyi random graph")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.legend()
plt.show()
We can see that the degree distribution of the users projection is very different from the one of the Erdos-Renyi random graph.
Even though I suspect that the Watts-Strogatz model will be even more different, I will still plot it for comparison.
Let's see how it compares to WS model with the same number of nodes and edges. In order to achieve that, at the beginning, each node should be connected to its k
nearest neighbors, where k
is the average degree of the users projection. I will plot degree distributions for several values of p
(probability of rewiring each edge). Note, that the higher the value of p
, the more edges will be rewired and the more the network will resemble the Erdos-Renyi random graph.
# Create a Watts-Strogatz small-world graph with the same number of nodes and edges as the largest connected component
values_of_p = [0.01, 0.1, 0.5]
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density", label="Users largest component")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of users largest connected component vs Watts-Strogatz small-world graph")
plt.xlabel("Degree")
plt.ylabel("Density")
for i, p in enumerate(values_of_p):
G_users_lcc_WS = nx.watts_strogatz_graph(G_users_lcc.number_of_nodes(), k=round(G_users_lcc_avg_degree), p=p, seed=42)
G_users_lcc_WS_degrees = calculate_degree_densities(G_users_lcc_WS)
# Decided to use a line plot instead of a scatter plot to make it easier to see the difference between graphs
sns.lineplot(data=G_users_lcc_WS_degrees, x="degree", y="density", label=f"WS random graph (p={p})", color=f"C{i+1}")
plt.legend()
plt.show()
As suspected, the degree distribution of the users projection is even more different from the one of the Watts-Strogatz model.
Let's see how it compares to the Barabasi-Albert model. Barabasi-Albert model is a model of a scale free network with preferential attachment. In this model, new nodes are added to the network one by one and each new node is connected to m
existing nodes. In order to achieve similar number of nodes to the users projection the parameter m
should be set to avg_degree/2
.
# Create a Barabasi-Albert scale-free graph with the same number of nodes and edges as the largest connected component
G_users_lcc_BA = nx.barabasi_albert_graph(G_users_lcc.number_of_nodes(), m=round(G_users_lcc_avg_degree/2), seed=42)
G_users_lcc_BA_degrees = calculate_degree_densities(G_users_lcc_BA)
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density", alpha=0.5, label="Users largest component")
sns.scatterplot(data=G_users_lcc_BA_degrees, x="degree", y="density", alpha=0.7, label="BA random graph", marker="x")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of users largest connected component vs Barabasi-Albert scale-free graph")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.legend()
plt.show()
Finally we can notice some similarities! The distribution of degrees of the nodes with degree higher than 100 is very similar to the one of the Barabasi-Albert.
I suspect that if we run the data collection process for a longer period of time and managed to reach subreddits on much higher depths, the degree distribution for the whole network would approach the one of the Barabasi-Albert model.
I have tried to calculate other metrics (such as: node betweennes, average clustering coefficient, average path length) for the network and the created models, but they were taking too long to compute for such large networks. Because of that, I decided to skip them for the users projection and focus on them in the next section when we will be analyzing the subreddits projection.
Let's create the subreddits projection of the network. The projection is a graph where nodes are subreddits and edges are created between subreddits if they have at least one common user. That way we will get a network with nodes of the same type which will allow us to analyze other metrics and comparisons with model networks.
# Create a projection
subreddits_nodes = [node for node in G_bipartite.nodes() if G_bipartite.nodes[node]["bipartite"] == "sub"]
G_subreddits = nx.bipartite.weighted_projected_graph(G_bipartite, subreddits_nodes, ratio=False)
print(f"Number of nodes: {G_subreddits.number_of_nodes()}")
print(f"Number of edges: {G_subreddits.number_of_edges()}")
print("\nSample node:")
print(list(G_subreddits.nodes(data=True))[0])
print("\nSample edge:")
print(list(G_subreddits.edges(data=True))[0])
Number of nodes: 1030 Number of edges: 14920 Sample node: ('r/announcements', {'bipartite': 'sub', 'subscribers': 202719824, 'num_posts': 138, 'num_users': 21, 'degree': 21}) Sample edge: ('r/announcements', 'r/ModSupport', {'weight': 4})
# Rename edge attributes 'weight' to 'common_users'
for edge in G_subreddits.edges():
G_subreddits.edges[edge]["common_users"] = G_subreddits.edges[edge].pop("weight")
num_common_users = [G_subreddits.edges[edge]["common_users"] for edge in G_subreddits.edges()]
# Plot the distribution of the number of common users
sns.histplot(num_common_users, stat="density", discrete=True)
plt.yscale("log")
plt.title("Distribution of the number of common users")
plt.xlabel("Number of common users")
plt.ylabel("Density")
plt.minorticks_on()
plt.show()
Compared to the users projection, the distribution of the number of common users is less skewed.
# Save edgelist to a csv file
with open(f"{NETWORKS_PATH}/subreddits.csv", "w") as f:
writer = csv.writer(f, delimiter=",", lineterminator="\n")
writer.writerow(["source", "target", "common_users"])
for edge in G_subreddits.edges(data=True):
writer.writerow([edge[0], edge[1], edge[2]["common_users"]])
This network is much smaller and it will be much easier to visualize. Nevertheless, I will use Gephi instead of Cytoscape as I think it provides more interesting layouts.
Below is the visualization of the subreddits projection. It uses Fruchterman-Reingold algorithm for calculating the layout.
Styles used for the visualization:
subscribers
- continuous mapping to the sizes of nodes.num_users
- continuous mapping to the colors of nodes (white -> low, red -> high).common_users
- continuous mapping to the colors of edges (yellow -> low, red -> high).common_users
- continuous mapping to the widths of edges.We can see that there is a dense core of subreddits with a lot of connections between them. The core mainly consists of subreddits with high number of users. That makes sense as the more users a subreddit has, the more likely it there will be an user among them that posted to another subreddit.
We can also notice, that there are some subreddits with high number of users that aren't that well connected. It could be because I stopped the data collection process in an early stage, and poorly connected subreddits are the ones for whose users I haven't collected data. Let's check that by styling the nodes according to their depth.
sub_depths_df = pd.DataFrame.from_dict(sub_depths, orient="index", columns=["depth"]).reset_index().sort_values("depth", ascending=True)
sub_depths_df = sub_depths_df.rename(columns={"index": "id"})
# Add 'r/' to the beginning of the subreddit names
sub_depths_df["id"] = sub_depths_df["id"].apply(lambda x: f"r/{x}")
# Remove subreddits that are not nodes of G_subreddits
sub_depths_df = sub_depths_df[sub_depths_df["id"].isin(G_subreddits.nodes())]
display(sub_depths_df.head())
sub_depths_df.to_csv(f"{NETWORKS_PATH}/sub_depths.csv", index=False)
id | depth | |
---|---|---|
0 | r/programming | 0 |
25 | r/pics | 1 |
26 | r/gifs | 1 |
27 | r/funny | 1 |
28 | r/WeatherGifs | 1 |
Below we can see the same visualization as above, but with the depth of subreddits mapped to the colors:
I also decided to set the size of nodes to a constant value, so it is easier to see the differences in the colors.
subs_components = list(nx.connected_components(G_subreddits))
subs_components.sort(key=len, reverse=True)
print(f"Number of connected components: {len(subs_components)}")
Number of connected components: 51
As expected, the number of connected components is the same as for the users projection. That's because both projections are created from the same bipartite graph.
# Identify the largest connected component
G_subs_lcc = G_subreddits.subgraph(subs_components[0])
G_subs_2nd_lcc = G_subreddits.subgraph(subs_components[1])
num_edges_complete_graph = G_subreddits.number_of_nodes() * (G_subreddits.number_of_nodes() - 1) / 2
components = [G_subreddits, G_subs_lcc, G_subs_2nd_lcc]
components_data = []
for component in components:
components_data.append(
[
component.number_of_nodes(),
component.number_of_edges(),
round(component.number_of_nodes() / G_subreddits.number_of_nodes() * 100, 4),
round(component.number_of_edges() / G_subreddits.number_of_edges() * 100, 4),
round(component.number_of_edges() / num_edges_complete_graph * 100, 4),
]
)
table = [
[
"",
"# nodes",
"# edges",
f"node % of\nthe network",
f"edge % of\nthe network",
f"edge % of the\ncomplete graph"
],
["Network", *components_data[0]],
["LC", *components_data[1]],
["2nd LC", *components_data[2]],
]
print(tabulate(table, headers="firstrow", tablefmt="fancy_grid"))
╒═════════╤═══════════╤═══════════╤═══════════════╤═══════════════╤══════════════════╕ │ │ # nodes │ # edges │ node % of │ edge % of │ edge % of the │ │ │ │ │ the network │ the network │ complete graph │ ╞═════════╪═══════════╪═══════════╪═══════════════╪═══════════════╪══════════════════╡ │ Network │ 1030 │ 14920 │ 100 │ 100 │ 2.8154 │ ├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤ │ LC │ 978 │ 14918 │ 94.9515 │ 99.9866 │ 2.8151 │ ├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤ │ 2nd LC │ 2 │ 1 │ 0.1942 │ 0.0067 │ 0.0002 │ ╘═════════╧═══════════╧═══════════╧═══════════════╧═══════════════╧══════════════════╛
The same as with the users projection, I will focus only on the largest component as it is a good representation of the whole network (~95% of nodes).
G_subs_lcc_avg_degree = sum([G_subs_lcc.degree(node) for node in G_subs_lcc.nodes()]) / G_subs_lcc.number_of_nodes()
G_subs_lcc_degrees = calculate_degree_densities(G_subs_lcc)
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_subs_lcc_degrees, x="degree", y="density")
plt.title("Degree distribution of subreddits largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.axvline(x=G_subs_lcc_avg_degree, color="red", linestyle="--", label=f"Average degree: {round(G_subs_lcc_avg_degree, 2)}")
plt.legend()
plt.show()
The graph looks promising for the Barabasi-Albert model as it seems to follow the power law distribution. Let's plot it in log-log scale to confirm that.
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_subs_lcc_degrees, x="degree", y="density")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of subreddits largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.axvline(x=G_subs_lcc_avg_degree, color="red", linestyle="--", label=f"Average degree: {round(G_subs_lcc_avg_degree, 2)}")
plt.legend()
plt.show()
It is not very clear. Let's try comparing it with the distribution of random Albert-Barabasi graph with the same number of nodes and edges. The explanation for the choose of the value of the parameter m
is the same as for the users projection.
G_subs_lcc_BA = nx.barabasi_albert_graph(G_subs_lcc.number_of_nodes(), m=round(G_subs_lcc_avg_degree / 2), seed=42)
G_subs_lcc_BA_degrees = calculate_degree_densities(G_subs_lcc_BA)
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_subs_lcc_degrees, x="degree", y="density", label="Subreddits largest component")
sns.scatterplot(data=G_subs_lcc_BA_degrees, x="degree", y="density", label="BA random graph", marker="x")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of subreddits largest connected component vs Barabasi-Albert scale-free graph")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.legend()
plt.show()
Plotting the distributions denies my suspicion. So the subreddits projection, similarly to the users projection doesn't follow any of the models we have seen in the class.
Let's see how the clustering coefficient and the average path length of the projection compares to other models.
# Create remaining models
G_subs_lcc_ER = nx.erdos_renyi_graph(
G_subs_lcc.number_of_nodes(),
G_subs_lcc.number_of_edges() / (G_subs_lcc.number_of_nodes() * (G_subs_lcc.number_of_nodes() - 1) / 2),
seed=42
)
G_subs_lcc_WS_01 = nx.watts_strogatz_graph(
G_subs_lcc.number_of_nodes(),
round(G_subs_lcc_avg_degree),
0.1,
seed=42
)
G_subs_lcc_WS_05 = nx.watts_strogatz_graph(
G_subs_lcc.number_of_nodes(),
round(G_subs_lcc_avg_degree),
0.5,
seed=42
)
G_sub_lcc_models = {
"Subreddits LCC": G_subs_lcc,
"Erdos-Renyi": G_subs_lcc_ER,
"Barabasi-Albert": G_subs_lcc_BA,
"Watts-Strogatz (p=0.1)": G_subs_lcc_WS_01,
"Watts-Strogatz (p=0.5)": G_subs_lcc_WS_05,
}
# Calculate clustering and average shortest path for each model
models_data = []
for model_name, model in G_sub_lcc_models.items():
models_data.append(
{
"model": model_name,
"clustering": nx.average_clustering(model),
"avg_shortest_path": nx.average_shortest_path_length(model),
}
)
models_df = pd.DataFrame(models_data)
display(models_df)
model | clustering | avg_shortest_path | |
---|---|---|---|
0 | Subreddits LCC | 0.395263 | 3.057360 |
1 | Erdos-Renyi | 0.030906 | 2.349040 |
2 | Barabasi-Albert | 0.081325 | 2.317469 |
3 | Watts-Strogatz (p=0.1) | 0.533901 | 2.792797 |
4 | Watts-Strogatz (p=0.5) | 0.111868 | 2.430817 |
Let's make sure that my assumption about the parameters for models is correct and I indeed create networks with similar number of edges
# Print num edges for each model
for model_name, model in G_sub_lcc_models.items():
print(f"{model_name}: {model.number_of_edges()}")
Subreddits LCC: 14918 Erdos-Renyi: 14794 Barabasi-Albert: 14445 Watts-Strogatz (p=0.1): 14670 Watts-Strogatz (p=0.5): 14670
# Plot clustering and average shortest path
plt.figure(figsize=(12,10))
sns.barplot(data=models_df, x="model", y="clustering")
plt.title("Clustering coefficient of the models")
plt.xlabel("Model")
plt.ylabel("Clustering coefficient")
plt.grid(axis="y", alpha=0.5)
plt.show()
plt.figure(figsize=(12,10))
sns.barplot(data=models_df, x="model", y="avg_shortest_path")
plt.title("Average shortest path of the models")
plt.xlabel("Model")
plt.ylabel("Average shortest path")
plt.grid(axis="y", alpha=0.5)
plt.show()
When it comes to the clustering coefficient and the average path length, the profile of the subreddits projection is the most similar to the one of Watts-Strogatz (p=0.1) model.
Network shows a relatively high clustering coefficient. Average path length is the highest of all networks. That means that the subreddits projection is a worst example of a small world network.
Let's find the most central subreddits in the network. I will use the following metrics:
Degree Centrality
- the number of neighbors of a node.Betweenness Centrality
- the number of shortest paths that pass through a node.Closeness Centrality
- the inverse of average distance to all other nodes.Eigenvector Centrality
- the sum of the centrality scores of the neighbors of a node.# Calculate node centralities
G_subs_lcc_centrality = {
"degree": nx.degree_centrality(G_subs_lcc),
"closeness": nx.closeness_centrality(G_subs_lcc),
"betweenness": nx.betweenness_centrality(G_subs_lcc),
"eigenvector": nx.eigenvector_centrality(G_subs_lcc),
}
# Convert to dataframe
G_subs_lcc_centrality_df = pd.DataFrame(G_subs_lcc_centrality)
# Add average column
G_subs_lcc_centrality_df["average"] = G_subs_lcc_centrality_df.mean(axis=1)
# Change index to column and rename to 'subreddit'
G_subs_lcc_centrality_df.reset_index(inplace=True)
G_subs_lcc_centrality_df.rename(columns={"index": "subreddit"}, inplace=True)
display(G_subs_lcc_centrality_df.head())
subreddit | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|
0 | r/announcements | 0.012282 | 0.323189 | 0.000304 | 0.000765 | 0.084135 |
1 | r/funny | 0.054248 | 0.392685 | 0.012252 | 0.026350 | 0.121384 |
2 | r/AskReddit | 0.002047 | 0.293041 | 0.000029 | 0.000286 | 0.073851 |
3 | r/gaming | 0.088025 | 0.421484 | 0.023730 | 0.048522 | 0.145440 |
4 | r/aww | 0.133060 | 0.435189 | 0.007469 | 0.078944 | 0.163666 |
# Display top 5 subreddits for each centrality
for centrality in G_subs_lcc_centrality_df.columns[1:]:
print(f"Top 5 subreddits by {centrality} centrality:")
display(G_subs_lcc_centrality_df.sort_values(by=centrality, ascending=False).head(5))
Top 5 subreddits by degree centrality:
subreddit | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|
358 | r/wholesomegifs | 0.216991 | 0.486312 | 0.013632 | 0.116376 | 0.208328 |
139 | r/BetterEveryLoop | 0.209826 | 0.480098 | 0.014909 | 0.114699 | 0.204883 |
71 | r/BeAmazed | 0.204708 | 0.475889 | 0.014100 | 0.112649 | 0.201837 |
301 | r/gifsthatkeepongiving | 0.201638 | 0.469260 | 0.008975 | 0.114427 | 0.198575 |
471 | r/blackpeoplegifs | 0.201638 | 0.469035 | 0.007927 | 0.113781 | 0.198095 |
Top 5 subreddits by closeness centrality:
subreddit | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|
358 | r/wholesomegifs | 0.216991 | 0.486312 | 0.013632 | 0.116376 | 0.208328 |
139 | r/BetterEveryLoop | 0.209826 | 0.480098 | 0.014909 | 0.114699 | 0.204883 |
71 | r/BeAmazed | 0.204708 | 0.475889 | 0.014100 | 0.112649 | 0.201837 |
110 | r/Eyebleach | 0.194473 | 0.472894 | 0.007956 | 0.110669 | 0.196498 |
152 | r/educationalgifs | 0.189355 | 0.470843 | 0.010917 | 0.107165 | 0.194570 |
Top 5 subreddits by betweenness centrality:
subreddit | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|
3 | r/gaming | 0.088025 | 0.421484 | 0.023730 | 0.048522 | 0.145440 |
43 | r/technology | 0.100307 | 0.434415 | 0.023059 | 0.022403 | 0.145046 |
14 | r/memes | 0.042989 | 0.379270 | 0.019108 | 0.013435 | 0.113700 |
460 | r/technews | 0.089048 | 0.425894 | 0.017507 | 0.021620 | 0.138517 |
846 | r/redesign | 0.035824 | 0.373043 | 0.015943 | 0.006659 | 0.107867 |
Top 5 subreddits by eigenvector centrality:
subreddit | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|
358 | r/wholesomegifs | 0.216991 | 0.486312 | 0.013632 | 0.116376 | 0.208328 |
139 | r/BetterEveryLoop | 0.209826 | 0.480098 | 0.014909 | 0.114699 | 0.204883 |
301 | r/gifsthatkeepongiving | 0.201638 | 0.469260 | 0.008975 | 0.114427 | 0.198575 |
471 | r/blackpeoplegifs | 0.201638 | 0.469035 | 0.007927 | 0.113781 | 0.198095 |
71 | r/BeAmazed | 0.204708 | 0.475889 | 0.014100 | 0.112649 | 0.201837 |
Top 5 subreddits by average centrality:
subreddit | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|
358 | r/wholesomegifs | 0.216991 | 0.486312 | 0.013632 | 0.116376 | 0.208328 |
139 | r/BetterEveryLoop | 0.209826 | 0.480098 | 0.014909 | 0.114699 | 0.204883 |
71 | r/BeAmazed | 0.204708 | 0.475889 | 0.014100 | 0.112649 | 0.201837 |
301 | r/gifsthatkeepongiving | 0.201638 | 0.469260 | 0.008975 | 0.114427 | 0.198575 |
471 | r/blackpeoplegifs | 0.201638 | 0.469035 | 0.007927 | 0.113781 | 0.198095 |
Out of the 4 centrality metrics, in 3 of them r/wholsomegifs
is the most central node.
At the beginning, I suspected that the most central subreddits will be the ones with the highest number of users. Let's check if that's the case.
G_subs_lcc_centrality_users = G_subs_lcc_centrality_df.merge(num_users_per_sub, on="subreddit")
# Calculate the mean centrality for each number of users
average_centrailties_per_num_of_users = G_subs_lcc_centrality_users.groupby("num_users").mean(numeric_only=True)
# Plot centrality vs number of users
fig, axes = plt.subplots(nrows=5, ncols=1, figsize=(12, 20))
fig.tight_layout(pad=3.0)
scatter_params = {
"alpha": 0.3,
"x": "num_users",
"data": G_subs_lcc_centrality_users,
}
line_params = {
"x": "num_users",
"label": "Average",
"data": average_centrailties_per_num_of_users,
"color": "C1"
}
sns.scatterplot(y="degree", ax=axes[0], **scatter_params)
sns.lineplot(y="degree", ax=axes[0], **line_params)
axes[0].set_title("Degree centrality")
axes[0].set_ylabel("Centrality")
sns.scatterplot(y="closeness", ax=axes[1], **scatter_params)
sns.lineplot(y="closeness", ax=axes[1], **line_params)
axes[1].set_title("Closeness centrality")
axes[1].set_ylabel("Centrality")
sns.scatterplot(y="betweenness", ax=axes[2], **scatter_params)
sns.lineplot(y="betweenness", ax=axes[2], **line_params)
axes[2].set_title("Betweenness centrality")
axes[2].set_ylabel("Centrality")
sns.scatterplot(y="eigenvector", ax=axes[3], **scatter_params)
sns.lineplot(y="eigenvector", ax=axes[3], **line_params)
axes[3].set_title("Eigenvector centrality")
axes[3].set_ylabel("Centrality")
sns.scatterplot(y="average", ax=axes[4], **scatter_params)
sns.lineplot(y="average", ax=axes[4], **line_params)
axes[4].set_title("Average centrality")
axes[4].set_ylabel("Centrality")
plt.show()
We can see that none of the centralities are correlated with the number of users. That at first might seem surprising but if we look once again at the plot of distribution of number of subreddits per user:
we can notice that vast majority of users are connected to only one subreddit, and the number of subreddits the user is connected to will influence the number of subreddits there will be connected to the subreddits of the user in the subreddits projection. So it is more important (in the context of centrality) for a subreddit to have few users that have posted in many subreddits than to have many users that have posted in only one (as this doesn't create new connections for a subreddit in the projection).
Just to be sure let's check how r/wholesomegifs
is ranked among other subreddits when it comes to the number of users, number of subscribers and number of posts collected.
# Save values of statistics of r/wholesomegifs
wholsomegifs_nums = {
"num_users": num_users_per_sub[num_users_per_sub["subreddit"] == "r/wholesomegifs"]["num_users"].values[0],
"num_posts": num_posts_per_sub[num_posts_per_sub["subreddit"] == "r/wholesomegifs"]["num_posts"].values[0],
"subscribers": num_subscribers_per_sub[num_subscribers_per_sub["subreddit"] == "r/wholesomegifs"]["subscribers"].values[0]
}
# Sort values by number of users, posts, subscribers
num_users_rank = num_users_per_sub.sort_values(by="num_users", ascending=False)
num_posts_rank = num_posts_per_sub.sort_values(by="num_posts", ascending=False)
num_subscribers_rank = num_subscribers_per_sub.sort_values(by="subscribers", ascending=False)
# Keep only rows with unique number of users, posts, subscribers in order to exclude ties
num_users_rank = num_users_rank.drop_duplicates(subset="num_users").reset_index(drop=True)
num_posts_rank = num_posts_rank.drop_duplicates(subset="num_posts").reset_index(drop=True)
num_subscribers_rank = num_subscribers_rank.drop_duplicates(subset="subscribers").reset_index(drop=True)
# Find index of first rows containing the number of users, posts, subscribers of r/wholesomegifs
wholsomegifs_rank = {
"num_users": num_users_rank[num_users_rank["num_users"] == wholsomegifs_nums["num_users"]].index[0] + 1,
"num_posts": num_posts_rank[num_posts_rank["num_posts"] == wholsomegifs_nums["num_posts"]].index[0] + 1,
"subscribers": num_subscribers_rank[num_subscribers_rank["subscribers"] == wholsomegifs_nums["subscribers"]].index[0] + 1,
}
# Print rank of r/wholesomegifs
print(f"r/wholesomegifs is ranked {wholsomegifs_rank['num_users']}/{len(num_users_rank)} by number of users")
print(f"r/wholesomegifs is ranked {wholsomegifs_rank['num_posts']}/{len(num_posts_rank)} by number of posts")
print(f"r/wholesomegifs is ranked {wholsomegifs_rank['subscribers']}/{len(num_subscribers_rank)} by number of subscribers")
r/wholesomegifs is ranked 35/81 by number of users r/wholesomegifs is ranked 80/343 by number of posts r/wholesomegifs is ranked 379/958 by number of subscribers
My theory seems to be correct as r/wholesomegifs
doesn't take a lead in any of the metrics.
What is also worth noting is the fact that the most central subreddits when it comes to degree, closeness and eigenvector centrality are mainly subreddits that are focused on entertainment and humor (memes, gifs, videos, etc.) without any specific topic in mind, such as:
r/wholesomegifs
- gifs that are supposed to make you feel good,r/BetterEveryLoop
- gifs that suppose to be better every time you watch them,r/BeAmazed
- gifs and videos that are supposed to amaze you,r/gifsthatkeepongiving
- gifs that are supposed to be funny every time you watch them,r/Eyebleach
- gifs that are supposed to be calming and cute,but when it comes to betweenness centrality, the most central subreddits are more focused on specific topics and news, such as:
r/gaming
- gaming news and discussions,r/technology
- technology news and novelties,r/technews
- technology news,r/redesign
- ideas and submissions for the Reddit platform redesign.That could mean that subreddits that focus on humor and entertainment create a well defined communities that gather a lot people with similar interests and are visited regularly, while subreddits that focus on specific topics and news are more likely to be visited by people that in case of need will look for a specific information or post a question about a specific topic and then leave the subreddit. If that theory is correct the subreddits from the second group can act as a bridge that connects people from different communities and that's why they are more central in the betweenness centrality metric.
Community detection is a process of finding groups of nodes that are more densely connected to each other than to the rest of the network. It is a very useful tool for analyzing networks as it can help us to understand the structure of the network and to find the most important nodes.
The community analysis will be performed only on the largest connected component of subreddits projection as the size of the users projection, unfortunately makes the analysis very time consuming.
The analysis of the subreddits projection will be done using the Louvain method. The method is based on the modularity optimization. The modularity of a network is a measure of how well the network is partitioned into communities. I have chosen this method as it is one of the most efficient ones for large networks O(n*log(n))
.
subs_lcc_lovain = list(nx.algorithms.community.louvain_communities(G_subs_lcc, seed=42))
print(f"Number of communities: {len(subs_lcc_lovain)}")
Number of communities: 11
Let's create function which maps each node to number representing its community.
def get_community_number(node, communities: list[set]) -> int:
for i, community in enumerate(communities):
if node in community:
return i
return -1
subs_lcc_data = pd.DataFrame(columns=["id", "community"])
subs_lcc_data["id"] = [node for node in G_subs_lcc.nodes()]
subs_lcc_data["community"] = subs_lcc_data["id"].apply(lambda x: get_community_number(x, subs_lcc_lovain))
display(subs_lcc_data.head(10))
id | community | |
---|---|---|
0 | r/announcements | 3 |
1 | r/funny | 1 |
2 | r/AskReddit | 3 |
3 | r/gaming | 1 |
4 | r/aww | 7 |
5 | r/Music | 8 |
6 | r/worldnews | 8 |
7 | r/todayilearned | 3 |
8 | r/movies | 1 |
9 | r/pics | 7 |
Let's join the communities dataframe with the subreddits dataframe to get more information about the communities and to make the data import to Gephi easier.
subs_lcc_data = subs_lcc_data.merge(sub_data, on="id")
display(subs_lcc_data.head(10))
id | community | subscribers | num_posts | num_users | is_user | |
---|---|---|---|---|---|---|
0 | r/announcements | 3 | 202719824 | 138 | 21 | False |
1 | r/funny | 1 | 48108476 | 60 | 17 | False |
2 | r/AskReddit | 3 | 40249936 | 13 | 4 | False |
3 | r/gaming | 1 | 36492322 | 75 | 22 | False |
4 | r/aww | 7 | 33655974 | 112 | 31 | False |
5 | r/Music | 8 | 32043294 | 145 | 37 | False |
6 | r/worldnews | 8 | 31254656 | 133 | 37 | False |
7 | r/todayilearned | 3 | 31041441 | 114 | 39 | False |
8 | r/movies | 1 | 30617572 | 252 | 35 | False |
9 | r/pics | 7 | 29880182 | 72 | 24 | False |
Save edgelist and node data of the subreddits projection largest connected component to files.
with open(os.path.join(NETWORKS_PATH, "subreddits_lcc.csv"), "w") as f:
writer = csv.writer(f, delimiter=",", lineterminator="\n")
writer.writerow(["source", "target", "common_users"])
for edge in G_subs_lcc.edges(data=True):
writer.writerow([edge[0], edge[1], edge[2]["common_users"]])
subs_lcc_data.to_csv(os.path.join(NETWORKS_PATH, "subreddits_lcc_data.csv"), index=False)
I will use the circular layout with colors and the order assigned to nodes according to their communities.
Other layout that may be helpful in visualization of size differences in communities may be the Group Attributes Layout.
subs_lcc_data = subs_lcc_data.rename(columns={"id": "subreddit"})
subs_lcc_data = subs_lcc_data.merge(G_subs_lcc_centrality_df, on="subreddit")
subs_lcc_data.head()
subreddit | community | subscribers | num_posts | num_users | is_user | degree | closeness | betweenness | eigenvector | average | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | r/announcements | 3 | 202719824 | 138 | 21 | False | 0.012282 | 0.323189 | 0.000304 | 0.000765 | 0.084135 |
1 | r/funny | 1 | 48108476 | 60 | 17 | False | 0.054248 | 0.392685 | 0.012252 | 0.026350 | 0.121384 |
2 | r/AskReddit | 3 | 40249936 | 13 | 4 | False | 0.002047 | 0.293041 | 0.000029 | 0.000286 | 0.073851 |
3 | r/gaming | 1 | 36492322 | 75 | 22 | False | 0.088025 | 0.421484 | 0.023730 | 0.048522 | 0.145440 |
4 | r/aww | 7 | 33655974 | 112 | 31 | False | 0.133060 | 0.435189 | 0.007469 | 0.078944 | 0.163666 |
communities = subs_lcc_data["community"].unique().tolist()
communities.sort()
# Plot the barplot with the number of subreddits in each community. Display the number of subreddits on top of each bar.
num_subreddits_per_community = {k: v for k, v in subs_lcc_data["community"].value_counts().items()}
plt.figure(figsize=(10, 5))
sns.barplot(x=communities, y=[num_subreddits_per_community[community] for community in communities], color="C0")
plt.title("Number of subreddits in each community")
plt.xlabel("Community")
plt.ylabel("Number of subreddits")
for i, community in enumerate(communities):
plt.text(i, num_subreddits_per_community[community], num_subreddits_per_community[community], ha="center", va="bottom")
plt.show()
We can see that the communities differ in size a lot, but there can be distinguished 4 groups of them when it comes to their sizes:
Let's try to analyze what the nodes in the communities have in common. For that reason I will plot the distributions of various metrics and properties for nodes in each community.
I will compare the distributions of the following values:
Note that the centrality measures are calculated for the whole largest connected component of the subreddits projection.
attributes_to_compare = ["subscribers", "num_posts", "num_users", "degree", "closeness", "betweenness", "eigenvector"]
labels = ["Number of subscribers", "Number of posts", "Number of users", "Degree centrality", "Closeness centrality", "Betweenness centrality", "Eigenvector centrality"]
for attribute, x_label in zip(attributes_to_compare, labels):
plt.figure(figsize=(10, 5))
sns.boxplot(x="community", y=attribute, data=subs_lcc_data, order=communities, color="C0")
plt.title(f"{x_label} in each community")
plt.xlabel("Community")
plt.ylabel(x_label)
plt.show()
Plot of number of subscribers is not readable because of the outliers. Let's filter them out.
num_subscribers_threshold = 10000000
plt.figure(figsize=(10, 5))
sns.boxplot(x="community", y="subscribers", data=subs_lcc_data[subs_lcc_data["subscribers"] < num_subscribers_threshold], order=communities, color="C0")
plt.title("Number of subscribers in each community")
plt.xlabel("Community")
plt.ylabel("Number of subscribers")
plt.show()
We can see that for every metric/property there are some communities that have similar distributions of values. That means that none of those properties directly influences and defines the structure of communities, and the splits in their formation. It makes sense as the louvain method is based on connectivity and we would not suspect the subreddits with e.g. low number of subscribers or low values of centrality to be stronger connected to each other than to some nodes with higher values of those metrics.
That said it does not mean that it has no influence at all. We can clearly notice that communities 5 and 7 have the best distributions of every of the centrality metrics. That means that those communities were formed of nodes with high centralities and that those nodes are densely connected to each other. Those communities are the most central (important) in the network.
Now let's try to analyze the centrality measures withing the communities itself.
G_communities = {}
# Create the subgraphs of each community
for community in communities:
nodes = subs_lcc_data[subs_lcc_data["community"] == community]["subreddit"].tolist()
G_communities[community] = G_subs_lcc.subgraph(nodes)
# Calculate the node centralities in the subgraphs of each community
G_communities_centralities = {}
for community, network in G_communities.items():
centralities = {
"degree": nx.degree_centrality(network),
"closeness": nx.closeness_centrality(network),
"betweenness": nx.betweenness_centrality(network),
"eigenvector": nx.eigenvector_centrality(network),
}
df = pd.DataFrame(centralities)
df.reset_index(inplace=True)
df = df.rename(columns={"index": "subreddit"})
# Calculate the average centralities for each community
G_communities_centralities[community] = {
"df": df,
"avg_degree": df["degree"].mean(),
"avg_closeness": df["closeness"].mean(),
"avg_betweenness": df["betweenness"].mean(),
"avg_eigenvector": df["eigenvector"].mean(),
}
It is suspected that centrality measures in the identified communities will be strongly affected by their sizes, so in order to easily compare the results i will arrange the bars in the plots according to the community sizes and display the number of nodes in each community once again.
# Plot the average centralities for each community
avg_centralities = ["avg_degree", "avg_closeness", "avg_betweenness", "avg_eigenvector"]
labels = ["Average degree centrality", "Average closeness centrality", "Average betweenness centrality", "Average eigenvector centrality"]
communities_by_sizes = sorted(communities, key=lambda x: num_subreddits_per_community[x], reverse=True)
for avg_centrality, label in zip(avg_centralities, labels):
plt.figure(figsize=(10, 5))
plt.bar(x=[f"{community}" for community in communities_by_sizes], height=[G_communities_centralities[community][avg_centrality] for community in communities_by_sizes])
plt.title(f"{label} for each community")
plt.xlabel("Community")
plt.ylabel(label)
plt.xticks(communities_by_sizes)
# Plot sizes of the communitites
for i, community in enumerate(communities_by_sizes):
plt.text(i, G_communities_centralities[community][avg_centrality], f"{num_subreddits_per_community[community]} nodes", ha="center", va="bottom", fontsize=8)
We can see that the average degree centrality tends to be higher in smaller communities and reaches 1 in the communities 10, 0, and 4. Those are communities with 3 and 2 nodes so it makes sense as every node in those communities is connected to every other node. However we still can notice that communities 5 and 7 have higher average degree centrality than expected from their sizes.
The same goes for the average closeness centrality.
Betweenness centrality looks a little bit different. It also tends to be higher in smaller communities, but for the fully connected communities 10, 0, and 4 it is equal to 0. It makes sense as in the fully connected communities there are no nodes that are in the middle of the shortest paths between other nodes. This time the communities 5 and 7 have lower average betweenness centrality than expected from their sizes. That, based on the earlier observations, is expected as nodes in those communities are densely connected to each other (relatively high degree and betweeness centrality) so there are less nodes that act as a bridges on the shortest paths between other nodes.
When it comes to the eignvector centrality there aren't any interesting observations about communitites 5 and 7 to be made.
Let's identify most central nodes in the communities and compare it to the results of the centralities analysis of the whole largest connected component of the subreddits projection. I'll do it only for the communities with more than 120 nodes as the other ones are too small to make any conclusions.
for community, stats in G_communities_centralities.items():
if num_subreddits_per_community[community] < 120:
continue
print("=" * 50)
print(f"Community {community}")
df = stats["df"]
for centrality in df.columns[1:]:
print(f"Top 5 subreddits by {centrality} centrality:")
display(df.sort_values(by=centrality, ascending=False).head(5))
================================================== Community 1 Top 5 subreddits by degree centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
23 | r/Games | 0.207101 | 0.414216 | 0.061387 | 0.262288 |
149 | r/anime | 0.207101 | 0.411192 | 0.066271 | 0.249170 |
13 | r/PS4 | 0.177515 | 0.376392 | 0.049306 | 0.230098 |
47 | r/NintendoSwitch | 0.171598 | 0.376392 | 0.011490 | 0.246041 |
32 | r/manga | 0.165680 | 0.414216 | 0.069636 | 0.212319 |
Top 5 subreddits by closeness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
32 | r/manga | 0.165680 | 0.414216 | 0.069636 | 0.212319 |
23 | r/Games | 0.207101 | 0.414216 | 0.061387 | 0.262288 |
149 | r/anime | 0.207101 | 0.411192 | 0.066271 | 0.249170 |
31 | r/Animemes | 0.112426 | 0.399527 | 0.102044 | 0.037119 |
77 | r/xboxone | 0.165680 | 0.396714 | 0.025082 | 0.231493 |
Top 5 subreddits by betweenness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
166 | r/memes | 0.130178 | 0.380631 | 0.122991 | 0.014298 |
6 | r/whenthe | 0.082840 | 0.359574 | 0.120841 | 0.007869 |
31 | r/Animemes | 0.112426 | 0.399527 | 0.102044 | 0.037119 |
9 | r/dankmemes | 0.112426 | 0.384966 | 0.087028 | 0.016052 |
122 | r/sciencememes | 0.065089 | 0.283557 | 0.076580 | 0.000511 |
Top 5 subreddits by eigenvector centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
23 | r/Games | 0.207101 | 0.414216 | 0.061387 | 0.262288 |
149 | r/anime | 0.207101 | 0.411192 | 0.066271 | 0.249170 |
47 | r/NintendoSwitch | 0.171598 | 0.376392 | 0.011490 | 0.246041 |
77 | r/xboxone | 0.165680 | 0.396714 | 0.025082 | 0.231493 |
13 | r/PS4 | 0.177515 | 0.376392 | 0.049306 | 0.230098 |
================================================== Community 5 Top 5 subreddits by degree centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
100 | r/trippinthroughtime | 0.705036 | 0.759563 | 0.018687 | 0.127546 |
135 | r/MadeMeSmile | 0.690647 | 0.751351 | 0.048629 | 0.126192 |
30 | r/youseeingthisshit | 0.690647 | 0.751351 | 0.012096 | 0.127460 |
64 | r/instant_regret | 0.690647 | 0.751351 | 0.029476 | 0.119631 |
62 | r/toptalent | 0.690647 | 0.751351 | 0.025570 | 0.128177 |
Top 5 subreddits by closeness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
100 | r/trippinthroughtime | 0.705036 | 0.759563 | 0.018687 | 0.127546 |
75 | r/blackmagicfuckery | 0.683453 | 0.751351 | 0.026652 | 0.127284 |
135 | r/MadeMeSmile | 0.690647 | 0.751351 | 0.048629 | 0.126192 |
30 | r/youseeingthisshit | 0.690647 | 0.751351 | 0.012096 | 0.127460 |
62 | r/toptalent | 0.690647 | 0.751351 | 0.025570 | 0.128177 |
Top 5 subreddits by betweenness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
79 | r/Damnthatsinteresting | 0.654676 | 0.735450 | 0.049689 | 0.120235 |
135 | r/MadeMeSmile | 0.690647 | 0.751351 | 0.048629 | 0.126192 |
108 | r/holdmycosmo | 0.553957 | 0.681373 | 0.038343 | 0.110630 |
55 | r/nextfuckinglevel | 0.525180 | 0.668269 | 0.029867 | 0.100198 |
64 | r/instant_regret | 0.690647 | 0.751351 | 0.029476 | 0.119631 |
Top 5 subreddits by eigenvector centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
62 | r/toptalent | 0.690647 | 0.751351 | 0.025570 | 0.128177 |
100 | r/trippinthroughtime | 0.705036 | 0.759563 | 0.018687 | 0.127546 |
30 | r/youseeingthisshit | 0.690647 | 0.751351 | 0.012096 | 0.127460 |
133 | r/BeAmazed | 0.676259 | 0.743316 | 0.009630 | 0.127296 |
75 | r/blackmagicfuckery | 0.683453 | 0.751351 | 0.026652 | 0.127284 |
================================================== Community 7 Top 5 subreddits by degree centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
108 | r/blackpeoplegifs | 0.731092 | 0.777778 | 0.040147 | 0.159521 |
2 | r/mechanical_gifs | 0.722689 | 0.772727 | 0.029777 | 0.162109 |
101 | r/wholesomegifs | 0.705882 | 0.762821 | 0.020877 | 0.161621 |
5 | r/BetterEveryLoop | 0.689076 | 0.753165 | 0.027479 | 0.157184 |
9 | r/whitepeoplegifs | 0.663866 | 0.739130 | 0.026280 | 0.153699 |
Top 5 subreddits by closeness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
108 | r/blackpeoplegifs | 0.731092 | 0.777778 | 0.040147 | 0.159521 |
2 | r/mechanical_gifs | 0.722689 | 0.772727 | 0.029777 | 0.162109 |
101 | r/wholesomegifs | 0.705882 | 0.762821 | 0.020877 | 0.161621 |
5 | r/BetterEveryLoop | 0.689076 | 0.753165 | 0.027479 | 0.157184 |
9 | r/whitepeoplegifs | 0.663866 | 0.739130 | 0.026280 | 0.153699 |
Top 5 subreddits by betweenness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
111 | r/aww | 0.445378 | 0.619792 | 0.049672 | 0.102025 |
102 | r/gifs | 0.495798 | 0.632979 | 0.042218 | 0.115448 |
108 | r/blackpeoplegifs | 0.731092 | 0.777778 | 0.040147 | 0.159521 |
26 | r/interestingasfuck | 0.647059 | 0.725610 | 0.036692 | 0.140872 |
32 | r/PraiseTheCameraMan | 0.470588 | 0.632979 | 0.031756 | 0.120913 |
Top 5 subreddits by eigenvector centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
2 | r/mechanical_gifs | 0.722689 | 0.772727 | 0.029777 | 0.162109 |
101 | r/wholesomegifs | 0.705882 | 0.762821 | 0.020877 | 0.161621 |
108 | r/blackpeoplegifs | 0.731092 | 0.777778 | 0.040147 | 0.159521 |
5 | r/BetterEveryLoop | 0.689076 | 0.753165 | 0.027479 | 0.157184 |
87 | r/gifsthatkeepongiving | 0.655462 | 0.730061 | 0.016129 | 0.154141 |
================================================== Community 8 Top 5 subreddits by degree centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
181 | r/technology | 0.20625 | 0.483384 | 0.072949 | 0.210070 |
77 | r/technews | 0.20000 | 0.480480 | 0.070091 | 0.204190 |
252 | r/Economics | 0.19375 | 0.474777 | 0.054601 | 0.203605 |
316 | r/environment | 0.19375 | 0.471976 | 0.046050 | 0.209977 |
190 | r/Coronavirus | 0.18750 | 0.455840 | 0.050940 | 0.197164 |
Top 5 subreddits by closeness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
181 | r/technology | 0.20625 | 0.483384 | 0.072949 | 0.210070 |
77 | r/technews | 0.20000 | 0.480480 | 0.070091 | 0.204190 |
252 | r/Economics | 0.19375 | 0.474777 | 0.054601 | 0.203605 |
316 | r/environment | 0.19375 | 0.471976 | 0.046050 | 0.209977 |
190 | r/Coronavirus | 0.18750 | 0.455840 | 0.050940 | 0.197164 |
Top 5 subreddits by betweenness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
181 | r/technology | 0.206250 | 0.483384 | 0.072949 | 0.210070 |
77 | r/technews | 0.200000 | 0.480480 | 0.070091 | 0.204190 |
268 | r/opensource | 0.159375 | 0.450704 | 0.062287 | 0.163320 |
252 | r/Economics | 0.193750 | 0.474777 | 0.054601 | 0.203605 |
190 | r/Coronavirus | 0.187500 | 0.455840 | 0.050940 | 0.197164 |
Top 5 subreddits by eigenvector centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
181 | r/technology | 0.20625 | 0.483384 | 0.072949 | 0.210070 |
316 | r/environment | 0.19375 | 0.471976 | 0.046050 | 0.209977 |
77 | r/technews | 0.20000 | 0.480480 | 0.070091 | 0.204190 |
252 | r/Economics | 0.19375 | 0.474777 | 0.054601 | 0.203605 |
190 | r/Coronavirus | 0.18750 | 0.455840 | 0.050940 | 0.197164 |
================================================== Community 9 Top 5 subreddits by degree centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
73 | r/ArchitecturePorn | 0.140000 | 0.366748 | 0.233477 | 0.373842 |
63 | r/spaceporn | 0.113333 | 0.352113 | 0.183196 | 0.295997 |
55 | r/Design | 0.093333 | 0.342466 | 0.172616 | 0.139031 |
79 | r/space | 0.086667 | 0.305499 | 0.040154 | 0.182302 |
62 | r/CatastrophicFailure | 0.080000 | 0.317797 | 0.059784 | 0.212146 |
Top 5 subreddits by closeness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
73 | r/ArchitecturePorn | 0.140000 | 0.366748 | 0.233477 | 0.373842 |
63 | r/spaceporn | 0.113333 | 0.352113 | 0.183196 | 0.295997 |
55 | r/Design | 0.093333 | 0.342466 | 0.172616 | 0.139031 |
91 | r/carporn | 0.073333 | 0.325380 | 0.115234 | 0.228131 |
18 | r/ImaginaryLandscapes | 0.073333 | 0.321888 | 0.044280 | 0.252624 |
Top 5 subreddits by betweenness centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
73 | r/ArchitecturePorn | 0.140000 | 0.366748 | 0.233477 | 0.373842 |
63 | r/spaceporn | 0.113333 | 0.352113 | 0.183196 | 0.295997 |
55 | r/Design | 0.093333 | 0.342466 | 0.172616 | 0.139031 |
92 | r/F1Technical | 0.060000 | 0.268817 | 0.123168 | 0.017416 |
91 | r/carporn | 0.073333 | 0.325380 | 0.115234 | 0.228131 |
Top 5 subreddits by eigenvector centrality:
subreddit | degree | closeness | betweenness | eigenvector | |
---|---|---|---|---|---|
73 | r/ArchitecturePorn | 0.140000 | 0.366748 | 0.233477 | 0.373842 |
63 | r/spaceporn | 0.113333 | 0.352113 | 0.183196 | 0.295997 |
90 | r/RoomPorn | 0.080000 | 0.317125 | 0.023597 | 0.273666 |
18 | r/ImaginaryLandscapes | 0.073333 | 0.321888 | 0.044280 | 0.252624 |
91 | r/carporn | 0.073333 | 0.325380 | 0.115234 | 0.228131 |
We can see that most of the subreddits leading in the communities (when it comes to the degree centrality) are the same as the ones leading in the whole largest connected component. However it is worth noting that the subreddit that previously the most central one in 3 out of 4 metrics (r/wholsomegifs
) is not the best in any of the metrics in the community 7. That makes sense as the most central nodes in the whole network will lose "relatively" the most upon the split into communities which leads to balancing of the centrality values in the subnetworks.
There is though much more interesting observation to be made. Just by looking at the centrally top subreddits in each community we can notice that the communities seem to be formed around specific topics. Let's investigate that.
In order to look into those topics, let's display all the subreddits in the communities that are in top NUM_TOP_SUBREDDITS
of any of the centrality measures for each community.
NUM_TOP_SUBREDDITS = 10
top_centrality_subreddits_per_community = {}
for community in communities:
df = G_communities_centralities[community]["df"]
subreddit_set = set()
for centrality in df.columns[1:]:
top_subreddits = df.sort_values(by=centrality, ascending=False).head(10)["subreddit"].tolist()
subreddit_set.update(top_subreddits)
top_centrality_subreddits_per_community[community] = subreddit_set
display(communities)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
display(top_centrality_subreddits_per_community[0])
{'r/intermittentfasting', 'r/keto'}
We can see that the community 0 has two subreddits that are both related to the topics of diet.
r/intermittenfasting
- a subreddit about a diet that focuses on when you eat rather than what you eat,r/keto
- a subreddit about a diet that focuses on eating low carb and high fat food.display(top_centrality_subreddits_per_community[1])
{'r/Animemes', 'r/Games', 'r/Konosuba', 'r/NintendoSwitch', 'r/PS4', 'r/PS5', 'r/PoliticalCompassMemes', 'r/ShingekiNoKyojin', 'r/XboxSeriesX', 'r/anime', 'r/dankmemes', 'r/horizon', 'r/manga', 'r/memes', 'r/movies', 'r/nintendo', 'r/sciencememes', 'r/television', 'r/totalwar', 'r/whenthe', 'r/xboxone'}
Community 1 has subreddits that are related mainly to topics of video games and anime. For example:
r/Animemes
- a subreddit about anime memes,r/Games
- a subreddit about video games,r/Konsuba
- a subreddit about an anime series.r/NintendoSwitch
- a subreddit about Nintendo Switch video games console.r/PS4
- a subreddit about video games on PlayStation 4 console.r/PS5
- a subreddit about video games on PlayStation 5 console.r/ShingekiNoKyojin
- a subreddit about an anime series.r/XboxSeriesX
- a subreddit about video games on Xbox Series X console.r/anime
- a subreddit about anime in general.r/horizon
- a subreddit about a video game Horizon Zero Dawn.r/manga
- a subreddit about manga in general.r/nintendo
- a subreddit about video games on Nintendo consoles.r/totalwar
- a subreddit about a video game Total War.r/xboxone
- a subreddit about video games on Xbox One console.display(top_centrality_subreddits_per_community[2])
{'r/Deltarune', 'r/DetroitBecomeHuman', 'r/Fallout', 'r/SampleSize', 'r/fo76', 'r/personalfinance', 'r/skyrim'}
In the community 2 out of 7 subreddits in the community 3 are related to popular games developed by Bethesda Softworks:
r/Fallout
- a subreddit about a Fallout video game series.r/fo76
- a subreddit about a Fallout 76 video game.r/skirym
- a subreddit about a The Elder Scrolls V: Skyrim video game.display(top_centrality_subreddits_per_community[3])
{'r/AskHistorians', 'r/DIY', 'r/Imposter', 'r/MovieDetails', 'r/SubredditAdoption', 'r/announcements', 'r/blog', 'r/changelog', 'r/modnews', 'r/place', 'r/redditsecurity', 'r/redesign', 'r/self', 'r/thebutton', 'r/todayilearned', 'r/woodworking'}
This community covers wider range of topics. But we can still distinguish some similarities.
Subreddits that are related to the Reddit itself (its features, rules, updates, etc.):
r/announcements
- a subreddit for Reddit announcements supplied by Reddit developers to inform the community about the most important changes and updates made to the platform.r/blog
- a subreddit with the official blog posts from Reddit Inc.r/changelog
- a subreddit with the changelog of the Reddit platform.r/modnews
- a subreddit with the news for Reddit moderators.r/redesign
- a subreddit about the changes to the design aspect of the Reddit.r/redditsecurity
- a subreddit with the security updates for Reddit users.Untypical subreddits that engages the community in some activities:
r/Imposter
- a subreddit where users can play a game similar in concept to the Among Us video game.r/place
- a subreddit where users can place a pixel on a big canvas every 5 minutes.r/thebutton
- a subreddit where users can press a button that resets every 60 seconds.Knowledge sharing subreddits:
r/AskHistorians
- a subreddit where users can ask questions about history and get answers from verified historians.r/DIY
- a subreddit where users can share their do-it-yourself projects.r/MovieDetails
- a subreddit where users can share interesting details about movies.r/todayilearned
- a subreddit where users can share interesting facts that they learned today.display(top_centrality_subreddits_per_community[4])
{'r/Warthunder', 'r/WorldOfWarships'}
This community contains 2 subreddits related to the topic of war video games:
r/Warthunder
- a subreddit about a War Thunder video game.r/WorldOfWarships
- a subreddit about a World of Warships video game.display(top_centrality_subreddits_per_community[5])
{'r/BeAmazed', 'r/Damnthatsinteresting', 'r/MEOW_IRL', 'r/MadeMeSmile', 'r/NatureIsFuckingLit', 'r/SweatyPalms', 'r/Whatcouldgowrong', 'r/WhitePeopleTwitter', 'r/blackmagicfuckery', 'r/holdmycosmo', 'r/instant_regret', 'r/maybemaybemaybe', 'r/nextfuckinglevel', 'r/toptalent', 'r/trippinthroughtime', 'r/youseeingthisshit'}
Community 5 contains mostly subreddits focused on astonishing or impressive content:
r/BeAmazed
- a subreddit with content that is supposed to amaze the users.r/Damnthatsinteresting
- a subreddit with content that is supposed to be interesting.r/NatureIsFuckingLit
- a subreddit in which people post about various interesting and cool facts related to nature.r/nextfuckinglevel
- a subreddit dedicated to showcasing content that exemplifies exceptional skills, extraordinary achievements, or remarkable advancements in various fields.r/toptalent
- a subreddit dedicated to showcasing exceptional talents and skills demonstrated by individuals.r/youseeingthisshit
- a subreddit that focuses on sharing and discussing extraordinary or unbelievable moments captured in videos or images.r/humansaremetal
- a subreddit that celebrates and showcases the incredible capabilities, resilience, and accomplishments of human beings.display(top_centrality_subreddits_per_community[6])
{'r/AmateurRoomPorn', 'r/Catswhoyell', 'r/Gameboy', 'r/Honda', 'r/Idiotswithguns', 'r/JDM', 'r/Lexus', 'r/Mid_Century', 'r/OneSecondBeforeDisast', 'r/ThriftStoreHauls', 'r/fuckcars', 'r/funnyvideos', 'r/gamecollecting', 'r/gtaonline', 'r/politecats', 'r/rally', 'r/tiktokcringemoment'}
In this community we can find subreddit that are related to more niche interests and hobbies. I don't know the majority of them so it is hard for me to asses what could be the factor that connects them topically.
display(top_centrality_subreddits_per_community[7])
{'r/BetterEveryLoop', 'r/PraiseTheCameraMan', 'r/WatchPeopleDieInside', 'r/WeatherGifs', 'r/aww', 'r/blackpeoplegifs', 'r/chemicalreactiongifs', 'r/gifs', 'r/gifsthatkeepongiving', 'r/interestingasfuck', 'r/lifehacks', 'r/mechanical_gifs', 'r/southpark', 'r/whitepeoplegifs', 'r/wholesomegifs', 'r/woahdude'}
The most common similarity among the majority of the listed subreddits is that they focus on sharing and discussing various types of animated content, particularly gifs.
r/BetterEveryLoop
- a subreddit is dedicated to sharing gifs or short videos that loop seamlessly and improve with each repetition, showcasing satisfying or impressive content.r/WeatherGifs
- a subreddit dedicated to gifs that showcase weather phenomena.r/blackpeoplegifs
- a subreddit is dedicated to sharing gifs featuring black individuals in various situations, often with a humorous or relatable context.chemicalreactiongifs
- a subreddit dedicated to gifs that showcase chemical reactions.r/gifs
- a subreddit dedicated to sharing gifs in general.r/gifsthatkeepongiving
- a subreddit is all about gifs that have an unexpected or continuous loop, creating humorous or mesmerizing effects.r/mechanical_gifs
- a subreddit dedicated to gifs that showcase mechanical processes.whitepeoplegifs
- a subreddit is dedicated to sharing gifs featuring white individuals in various situations, often with a humorous or relatable context.r/wholsomegifs
- a subreddit focuses on sharing heartwarming and uplifting gifs or videos that evoke positive emotions.display(top_centrality_subreddits_per_community[8])
{'r/Coronavirus', 'r/Economics', 'r/EverythingScience', 'r/Futurology', 'r/UpliftingNews', 'r/environment', 'r/opensource', 'r/politics', 'r/privacy', 'r/technews', 'r/technology'}
Those subreddits focus on topics related to science, technology, economics, and societal issues.
r/Coronavirus
- a subreddit dedicated to the 2019 coronavirus COVID-19 outbreak.r/Economics
- a subreddit dedicated to the science of economics and the discussion of issues and news related to it.r/EverythingScience
- a subreddit dedicated to the discussion of science and scientific phenomena.r/Futurology
- a subreddit dedicated to the discussion of future developments in science and technology.r/environment
- a subreddit dedicated to the discussion of environmental issues.r/politics
- a subreddit dedicated to the discussion of political issues.r/technews
- a subreddit dedicated to the discussion of technology news.r/technology
- a subreddit dedicated to the discussion of technology.display(top_centrality_subreddits_per_community[9])
{'r/ArchitecturePorn', 'r/Breath_of_the_Wild', 'r/CatastrophicFailure', 'r/Cyberpunk', 'r/Design', 'r/DiWHY', 'r/F1Technical', 'r/ImaginaryLandscapes', 'r/MacroPorn', 'r/Outdoors', 'r/RoomPorn', 'r/TechnicalDeathMetal', 'r/arduino', 'r/astrophotography', 'r/carporn', 'r/europe', 'r/nostalgia', 'r/photography', 'r/risa', 'r/space', 'r/spaceporn', 'r/wow'}
Majority of subreddits in this community focus on design visual content and aesthetics and photography.
Architecture and design:
r/ArchitecturePorn
- a subreddit dedicated to sharing images of interesting architecture.r/Design
- a subreddit dedicated to the discussion of design in general.r/RoomPorn
- a subreddit dedicated to sharing images of interior design and aesthetically pleasing rooms.Landscapes and outdoors:
r/Outdoors
- a subreddit dedicated to the discussion of pleasing outdoor pictures.r/ImaginaryLandscapes
- a subreddit dedicated to sharing images of dreamlike landscapes.r/astrophotography
- a subreddit dedicated to sharing images of space and celestial bodies.r/spaceporn
- a subreddit dedicated to sharing images of space and celestial bodies.r/space
- a subreddit dedicated to the discussion of space and space exploration.display(top_centrality_subreddits_per_community[10])
{'r/BiggerThanYouThought', 'r/bigtiddygothgf', 'r/u_nicolebun'}
Those subreddits are related to porn or explicit content.
We can see that the subreddits were spilt into communities according to the topics around which they are centered. The communities are not perfectly separated, but we can still see clear patterns.
This outcome is not theoretically surprising as the subreddits in the projection are connected when they have an common user who posted in them, and people tend to post in the subreddits that are related to their interests. Therefore, the subreddits that are related to the same topic are more likely to be connected in the projection. What is surprising to me is the fact that this topic similarity is so clear and that the subreddits are so well separated into communities.
Robustness is a measure of how well the network can withstand the removal of nodes. It is a very important property of networks as it can help us to understand how the network will behave in case of failure of some of its nodes.
The bipartite network that we have created can show us how information could flow between users and subreddits. Let's imagine that one user has posted something to a subreddit. Other users on that subreddit could see the post and decide to pass the information to other subreddits they tend to visit. That way the information could spread through the network.
In the context of the subreddits projection, it is highly unlikely that the subreddits would be removed from the network. Communities rarely are closed/banned from Reddit. It is much more likely that some of the users would be removed from the network. That could happen if they were to be banned from Reddit or if they were to delete their accounts.
So in order to give sense to the question of robustness, I will analyze how the subreddits projection would behave if some of the users were to be removed from the network. That approach is a mix of:
in the context of effect on the subreddits projection.
In order to perform the analysis, I need to create a dataset that would contain all the common users between each pair of subreddits. I will use previously created user_sub_pairs
dataframe.
display(user_sub_pairs.shape)
display(user_sub_pairs.head())
(38054, 3)
author | subreddit | num_posts | |
---|---|---|---|
0 | u/--5- | r/india | 2 |
1 | u/--CreativeUsername | r/Physics | 2 |
2 | u/--Fatal-- | r/homelab | 2 |
3 | u/--MVH-- | r/Netherlands | 4 |
4 | u/--Speed-- | r/logodesign | 2 |
Let's keep only the rows with subreddits that are in the largest connected component
user_sub_pairs_subs_lcc = user_sub_pairs[user_sub_pairs["subreddit"].isin(subs_lcc_data["subreddit"].tolist())]
display(user_sub_pairs_subs_lcc.shape)
display(user_sub_pairs_subs_lcc.head())
(36691, 3)
author | subreddit | num_posts | |
---|---|---|---|
0 | u/--5- | r/india | 2 |
1 | u/--CreativeUsername | r/Physics | 2 |
2 | u/--Fatal-- | r/homelab | 2 |
3 | u/--MVH-- | r/Netherlands | 4 |
4 | u/--Speed-- | r/logodesign | 2 |
Let's create the list of users for each subreddit
users_per_sub_subs_lcc = user_sub_pairs_subs_lcc.groupby("subreddit")["author"].apply(list).reset_index()
display(users_per_sub_subs_lcc.head())
subreddit | author | |
---|---|---|
0 | r/13or30 | [u/Balls-over-dick-man-, u/FormerFruit, u/TheS... |
1 | r/196 | [u/1milionand6times, u/Alex9586, u/Anormalredd... |
2 | r/2020PoliceBrutality | [u/ApartheidReddit, u/ApartheidUSA, u/BiafraMa... |
3 | r/2meirl4meirl | [u/-wao, u/9w_lf9, u/ArticckK, u/BlueBerryChar... |
4 | r/3Dprinting | [u/3demonster, u/Antique_Steel, u/Bigbore_729,... |
Let's create the sub-sub dataframe with the list of common users for each pair of subreddits
frames = []
for i, row_1 in enumerate(users_per_sub_subs_lcc.itertuples()):
for j, row_2 in enumerate(users_per_sub_subs_lcc[i+1:].itertuples()):
common_users = set(row_1.author).intersection(row_2.author)
if len(common_users) > 0:
frames.append([row_1.subreddit, row_2.subreddit, common_users])
sub_sub_common_users_subs_lcc = pd.DataFrame(frames, columns=["subreddit_1", "subreddit_2", "common_users"])
display(sub_sub_common_users_subs_lcc.head())
subreddit_1 | subreddit_2 | common_users | |
---|---|---|---|
0 | r/13or30 | r/AbsoluteUnits | {u/FormerFruit} |
1 | r/13or30 | r/MrRobot | {u/FormerFruit} |
2 | r/13or30 | r/extremelyinfuriating | {u/FormerFruit} |
3 | r/13or30 | r/foxes | {u/FormerFruit} |
4 | r/13or30 | r/interestingasfuck | {u/prolelol} |
Other helpful dataframe will be the one with the number of subreddits each user has posted in. We will use previously created num_subs_per_user
dataframe.
num_subs_per_user_subs_lcc = num_subs_per_user[num_subs_per_user["author"].isin(user_sub_pairs_subs_lcc["author"].tolist())]
display(num_subs_per_user_subs_lcc.head())
author | num_subs | |
---|---|---|
0 | u/My_Memes_Will_Cure_U | 63 |
1 | u/Master1718 | 60 |
2 | u/memezzer | 49 |
3 | u/KevlarYarmulke | 47 |
4 | u/5_Frog_Margin | 45 |
The analysis will be performed in the following way:
G_subs_lcc = nx.from_pandas_edgelist(sub_sub_common_users_subs_lcc, source="subreddit_1", target="subreddit_2", edge_attr="common_users")
Let's see how the network looks like.
nx.draw(G_subs_lcc, pos=nx.spring_layout(G_subs_lcc, seed=42), node_size=10, width=0.1)
# In order to speed up the calculation, let's create a dictionary
# that maps each user to the edges that contain it
user_edges_subs_lcc = defaultdict(list)
for edge in G_subs_lcc.edges(data=True):
for user in edge[2]["common_users"]:
user_edges_subs_lcc[user].append((edge[0], edge[1]))
First let's analyze the situation in which we remove the users at random (random failure scenario).
np.random.seed(42)
users_subs_lcc = num_subs_per_user_subs_lcc["author"].tolist()
users_subs_lcc = np.random.permutation(users_subs_lcc)
# We have to use deepcopy because we will modify the sets of users
G_subs_lcc_random = deepcopy(G_subs_lcc)
subs_lcc_efficiencies_random = []
subs_lcc_efficiencies_random.append(nx.global_efficiency(G_subs_lcc_random))
subs_lcc_efficiencies_random_num_edges_removed = [0]
for user_to_remove in tqdm(users_subs_lcc):
edges_with_user = user_edges_subs_lcc[user_to_remove]
edges_to_remove = []
for edge_name in edges_with_user:
edge = G_subs_lcc_random.edges[edge_name]
edge["common_users"].remove(user_to_remove)
if len(edge["common_users"]) == 0:
edges_to_remove.append(edge_name)
G_subs_lcc_random.remove_edges_from(edges_to_remove)
subs_lcc_efficiencies_random_num_edges_removed.append(len(edges_to_remove))
# To speed up the calculation, I will only calculate the global efficiency
# if I removed at least one edge
# otherwise, we will just append the last value
if len(edges_to_remove) > 0:
subs_lcc_efficiencies_random.append(nx.global_efficiency(G_subs_lcc_random))
else:
subs_lcc_efficiencies_random.append(subs_lcc_efficiencies_random[-1])
if G_subs_lcc_random.number_of_edges() == 0:
break
100%|██████████| 30950/30950 [1:02:02<00:00, 8.31it/s]
Now, let's analyze the situation in which we remove the users with the highest number of subreddits they have posted in (targeted attack scenario). This approach makes less sense as it is less likely that very active users would stop using Reddit, but it will be interesting to see how the network will behave in such scenario.
users_subs_lcc = num_subs_per_user_subs_lcc.sort_values(by="num_subs", ascending=False)["author"].tolist()
G_subs_lcc_targeted = deepcopy(G_subs_lcc)
subs_lcc_efficiencies_targeted = []
subs_lcc_efficiencies_targeted.append(nx.global_efficiency(G_subs_lcc_targeted))
subs_lcc_efficiencies_targeted_num_edges_removed = [0]
for user_to_remove in tqdm(users_subs_lcc):
edges_with_user = user_edges_subs_lcc[user_to_remove]
edges_to_remove = []
for edge_name in edges_with_user:
edge = G_subs_lcc_targeted.edges[edge_name]
edge["common_users"].remove(user_to_remove)
if len(edge["common_users"]) == 0:
edges_to_remove.append(edge_name)
G_subs_lcc_targeted.remove_edges_from(edges_to_remove)
subs_lcc_efficiencies_targeted_num_edges_removed.append(len(edges_to_remove))
if len(edges_to_remove) > 0:
subs_lcc_efficiencies_targeted.append(nx.global_efficiency(G_subs_lcc_targeted))
else:
subs_lcc_efficiencies_targeted.append(subs_lcc_efficiencies_targeted[-1])
if G_subs_lcc_targeted.number_of_edges() == 0:
break
100%|██████████| 30950/30950 [1:01:32<00:00, 8.38it/s]
Let's plot the global efficiency of the network in both scenarios.
plt.figure(figsize=(13, 10))
plt.plot(subs_lcc_efficiencies_random, label="Random")
plt.plot(subs_lcc_efficiencies_targeted, label="Targeted")
plt.title("Global efficiency of the subreddits LCC")
plt.xlabel("Number of users removed")
plt.ylabel("Global efficiency")
plt.legend()
plt.show()
We can see that the network is relatively resilient to random failures. However, it is not the case for targeted attacks. We can see that when the most active users are removed from the network, the global efficiency drops significantly. It is suspected as it is quite common for social networks to demonstrate Zipf's law behaviors. In this case it means that small amount of users are responsible for the majority of the network's connections. Keep in mind that (as stated before) it is quite unlikely that the most active users would stop using Reddit, so we can consider the network to be quite robust.
It would be interesting to compare the network to other random network models in the context of robustness. However it would not be easy as I do not analyze typical percolation or robustness scenarios, but rather a mix of them with somewhat of a edge-weighted approach.
Diffusion (similarly to previous sections) will be performed only on the largest connected component of subreddits projection as the size of the users projection, unfortunately makes the analysis very time consuming.
Diffusion in the context of the subreddits projection can be understood as a process of spreading information through the network. The spreading of information can be modeled as a process of spreading of a virus. In this case the virus would be the information and the nodes would be the subreddits connected in the projection by the users that posted in them.
I will model the fenomenom of diffusion similarly to a contact SI model, but with a twist. In the SI model, the nodes can be in one of two states: susceptible or infected, similarly I will define two states for the subreddits:
S
- the information could be spread to the subreddit and it cannot be spread from it,I
- the information can spread from it to other subreddits.The difference between the SI model and the model I will use is that in the SI model the probability of spreading the virus is constant, but in my approach probability of spreading the information decreases with time as it is more likely that an user would see a new post than an old one. The probability will also depend on the number of common users between the subreddits. I have also assumed that the "infection" in the subreddit that already has been exposed to the spread of the information can also be "refreshed" (the information could be re-spread to it).
Note that sometimes I will refer to the information as a virus, and to the nodes to which the information was spread as "infected" but it is only for the sake of simplicity.
# Create the network
G_subs_lcc = nx.from_pandas_edgelist(sub_sub_common_users_subs_lcc, source="subreddit_1", target="subreddit_2", edge_attr="common_users")
# Change common_users attribute to the number of common users
for edge in G_subs_lcc.edges(data=True):
edge[2]["common_users"] = len(edge[2]["common_users"])
To visualize the network I have exported the network from Cytoscape and extracted the node positions to csv file. That way we can use good-looking layout with the nodes grouped according to their communities.
node_pos = pd.read_csv(os.path.join(NETWORKS_PATH, "subreddits_lcc_communities_node_collection_from_cyto.csv"))
node_pos = node_pos.set_index("subreddit")
# Convert the dataframe to a dictionary and to a format that can be used by networkx
node_pos_dict = node_pos.to_dict(orient="index")
node_pos_dict = {k: (v["x"], v["y"]) for k, v in node_pos_dict.items()}
Let's draw the network to see if the positions are correct.
plt.figure(figsize=(15, 10))
plt.gca().set_aspect("equal", adjustable="box")
nx.draw(
G_subs_lcc,
pos=node_pos_dict,
node_size=10,
width=0.1,
with_labels=False,
)
plt.show()
Let's define the function that calculates the probability of spreading the information between two subreddits.
The function that I chose is exponential decay function. It is a function that is often used to model the decay of a quantity over time. In this case the quantity is the probability of spreading the information between two subreddits and the time is the time since the information was posted to the subreddit.
I don't know if it is the best function for this purpose, but it seems to be a good fit.
MAX_PROBABILITY_OF_SPREAD = 0.5
MAX_COMMON_USERS = subs_lcc_data["num_users"].max()
def probability_of_spread(num_common_users: int, time_from_post: int,):
return MAX_PROBABILITY_OF_SPREAD * math.exp(-time_from_post / 20) * (num_common_users / MAX_COMMON_USERS)
nums_users = [1, 5, 10, 20, 50, 90]
plt.figure(figsize=(12, 8))
for num_users in nums_users:
plt.plot(
[probability_of_spread(num_users, time_from_post) for time_from_post in range(50)],
label=f"Number of common users: {num_users}"
)
plt.legend()
plt.title("Probability of spread for different number of common users")
plt.xlabel("Time from post")
plt.ylabel("Probability of spread")
plt.show()
ANIMATION_PATH = os.path.join(os.getcwd(), "diffusion_animation")
Let's define the function that assigns the colors to the nodes according to their states. Blue nodes are the ones that have not had the information spread to them yet, and red nodes are the ones that have the information spread to them. The red nodes are getting lighter with time, as the probability of spreading the information decreases.
MAX_TIME_FROM_POST = 50
def get_node_color(frame, t_spread):
if t_spread == None:
return "blue"
time_from_post = frame - t_spread
time_from_post = time_from_post / MAX_TIME_FROM_POST
return plt.cm.autumn(time_from_post)
# Plot color legend
plt.figure(figsize=(7, 1))
for i in range(100):
plt.scatter(i, 0, color=get_node_color(i, 0))
plt.gca().axes.get_yaxis().set_visible(False)
plt.title("Node color with respect to time from the last spread to it")
plt.xlabel("Time from spread")
plt.show()
Let's run the simulation for 1000 iterations.
NUM_FRAMES = 1000
# None of the nodes have been infected yet
for node in G_subs_lcc.nodes():
G_subs_lcc.nodes[node]["t_spread"] = None
# Choose a random node to start the spread
staring_node = np.random.choice(list(G_subs_lcc.nodes()))
G_subs_lcc.nodes[staring_node]["t_spread"] = 0
num_nodes_spread = []
for frame in tqdm(range(NUM_FRAMES)):
for node in G_subs_lcc.nodes():
t_spread = G_subs_lcc.nodes[node]["t_spread"]
# If the node hasn't been infected, do nothing
if t_spread is not None:
neighbors = G_subs_lcc.neighbors(node)
for neighbor in neighbors:
common_users = G_subs_lcc.edges[node, neighbor]["common_users"]
p_spread = probability_of_spread(common_users, frame - t_spread)
if np.random.random() < p_spread:
# Spread the infection
G_subs_lcc.nodes[neighbor]["t_spread"] = frame
# Save the number of nodes that have been infected
num_nodes_spread.append(len([node for node in G_subs_lcc.nodes() if G_subs_lcc.nodes[node]["t_spread"] is not None]))
# Plot the network
plt.figure(figsize=(10, 10))
plt.gca().set_aspect("equal", adjustable="box")
nx.draw(
G_subs_lcc,
pos=node_pos_dict,
node_size=10,
width=0.1,
with_labels=False,
node_color=[get_node_color(frame, G_subs_lcc.nodes[node]["t_spread"]) for node in G_subs_lcc.nodes()],
)
# Save the plot
plt.title(f"Frame {frame}")
plt.savefig(os.path.join(ANIMATION_PATH, f"frame_{frame}.png"))
# Show the plot every 50 frames
if frame % 10 == 0:
clear_output(wait=True)
plt.show()
plt.clf()
plt.close()
100%|██████████| 1000/1000 [35:34<00:00, 2.13s/it]
Let's combine the frames into a video.
IMG_PATH = os.path.join(os.getcwd(), "img")
filenames = []
for filename in os.listdir(ANIMATION_PATH):
filenames.append(os.path.join(ANIMATION_PATH, filename))
filenames.sort(key=lambda x: int(x.split("_")[-1].split(".")[0]))
mp4_writer = imageio.get_writer(os.path.join(IMG_PATH, "diffusion_animation.mp4"), fps=15)
for filename in tqdm(filenames):
mp4_writer.append_data(imageio.imread(filename))
mp4_writer.close()
0%| | 0/1000 [00:00<?, ?it/s]C:\Users\steci\AppData\Local\Temp\ipykernel_17996\3936832062.py:10: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly. mp4_writer.append_data(imageio.imread(filename)) IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (1000, 1000) to (1008, 1008) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility). 100%|██████████| 1000/1000 [00:54<00:00, 18.50it/s]
num_nodes_spread_np = np.array(num_nodes_spread)
num_nodes_spread_np = num_nodes_spread_np / G_subs_lcc.number_of_nodes()
plt.figure(figsize=(12, 8))
plt.plot(num_nodes_spread_np, label="Infected nodes")
plt.plot(1 - num_nodes_spread_np, label="Uninfected nodes")
plt.title("Part of the network to which the information has spread over time")
plt.xlabel("Time")
plt.ylabel("Part of the network")
plt.legend()
plt.show()
print(F"Part of the network infected after {NUM_FRAMES} frames: {round(num_nodes_spread_np[-1], 4)}")
Part of the network infected after 1000 frames: 0.9826
We can see that the results somewhat resemble the ones obtained with classic SI models. 1000 iterations was not enough to spread the information to the whole network with the parameters I have chosen. We can also see that the information needed quite some time to start spreading, but when it did, it spread very quickly. This probably heavily depends on the starting node that was chosen.
Other analysis that could be performed in the future:
In conclusion, this project delved into the fascinating realm of users and subreddits interactions within a bipartite Reddit network. Through comprehensive analysis, various aspects of this network were explored.
By examining the node degree distributions, we gained insights into the patterns of user-subreddit connections, identifying hubs and peripheral nodes within the network.
Centralities helped us understand the importance and influence of individual nodes, highlighting key users and subreddits that play significant roles in information dissemination and interaction dynamics.
The identification of communities within the network shed light on the thematic clusters and groupings of subreddits, enabling a deeper understanding of the social dynamics and shared interests present in the Reddit community.
Additionally, the assessment of network robustness provided valuable insights into the network's resilience to node or link removal, informing strategies for network optimization and stability.
Exploring diffusion processes within the network allowed us to investigate how information or influence spreads among users and subreddits.
Moreover, comparing the network with various random network models provided a benchmark for evaluating its structural characteristics, highlighting its uniqueness and uncovering specific properties that differentiate it from random structures.
Overall, this project presents a comprehensive analysis of a users-subreddits bipartite network, uncovering valuable insights into its structure and dynamics. The findings contribute to the growing field of social network analysis, offering a deeper understanding of online communities, information flow, and social interactions within the Reddit platform.
I hereby declare that all of my code, text, and figures were produced by myself.