SNA Project - Adam Steciuk¶

Part 1¶

Data collection¶

I decided to create the dataset used for the project myself. I used the Reddit API and Python library wrapper called Praw.

The data I wanted to collect was posts from different subreddits. My methodology for collecting the data was as follows:

  1. Start with a queue with one subreddit.
  2. Take the first subreddit from the queue.
  3. Collect the top NUM_POSTS_FROM_SUB posts from the subreddit.
  4. Discard posts from users that have less than MIN_TIMES_POSTED posts in those NUM_POSTS_FROM_SUB posts.
  5. For each user that is left, collect the top NUM_POSTS_FROM_USER posts from their profile and add the subreddits they are posted in to the queue if they weren't already in there.
  6. Repeat steps 2-5 until the queue is empty or the subreddits that processed are further than MAX_DEPTH from the starting subreddit.

In order for the following code to work, you need to create a file called reddit_secrets.py in the same directory as this notebook. The file should contain the secrets you got from reddit when you created your reddit app. The file should look like this:

CLIENT_ID = "XXXXXXXXXXXXXXXXXXX"
CLIENT_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXX"
USER_AGENT = "your_app_name"
In [2]:
import praw
import os
import pickle
import csv
import random
import math
import imageio

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
from tabulate import tabulate
from collections import defaultdict
from copy import deepcopy
from IPython.display import clear_output

# Import the secrets
from reddit_secrets import CLIENT_ID, CLIENT_SECRET, USER_AGENT
In [3]:
# Directory for raw data
DATA_DIRECTORY = 'data'
# Directory for processed data used in Cytoscape
NETWORKS_DIRECTORY = 'networks'

DATA_PATH = os.path.join(os.getcwd(), DATA_DIRECTORY)
NETWORKS_PATH = os.path.join(os.getcwd(), NETWORKS_DIRECTORY)

if not os.path.exists(DATA_PATH):
    os.mkdir(DATA_PATH)

if not os.path.exists(NETWORKS_PATH):
    os.mkdir(NETWORKS_PATH)

The script below was running for about 8h on my machine before it encountered an internal server error from Reddit. Unfortunately it didn't even managed to process all subreddits at depth 2. I decided to create a method that would allow me to resume the process from the last subreddit that was processed. That also allows me to stop and resume the process on demand.

The script at the start tries to load the state from the pickle file. If it fails, it means that the file doesn't exist and the script starts from scratch. If it succeeds, it means that the file exists and the script resumes from the last subreddit that was processed.

In [4]:
SCRIPT_SAVE_PATH = os.path.join(os.getcwd(), 'script_save.pkl')
script_save = None
try:
    with open(SCRIPT_SAVE_PATH, 'rb') as f:
        script_save = pickle.load(f)

    print("Loaded script save. Resuming...")
    print("NUM_POSTS_FROM_SUB:", script_save["NUM_POSTS_FROM_SUB"])
    print("NUM_POSTS_OF_USER:", script_save["NUM_POSTS_OF_USER"])
    print("MIN_TIMES_POSTED:", script_save["MIN_TIMES_POSTED"])
    print("MAX_DEPTH:", script_save["MAX_DEPTH"])
    print("Number of subreddits in queue:", len(script_save["sub_q"]))
    print("Number of posts saved so far:", script_save["num_posts_saved"])
except:
    print("No script save found. Starting from scratch...")
    

NUM_POSTS_FROM_SUB = 500 if script_save is None else script_save["NUM_POSTS_FROM_SUB"]
NUM_POSTS_OF_USER = 5 if script_save is None else script_save["NUM_POSTS_OF_USER"]
MIN_TIMES_POSTED = 2 if script_save is None else script_save["MIN_TIMES_POSTED"]
MAX_DEPTH = 5 if script_save is None else script_save["MAX_DEPTH"]

sub_q = ["programming"] if script_save is None else script_save["sub_q"]
sub_depths = {sub_q[0]: 0} if script_save is None else script_save["sub_depths"]
skipped_subs = [] if script_save is None else script_save["skipped_subs"]

reddit = praw.Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=USER_AGENT)

num_posts_saved = 0 if script_save is None else script_save["num_posts_saved"]

# BFS; take the first subreddit from the queue.
while len(sub_q) > 0 and (sub:=sub_q.pop(0)):
    print("=========================================")
    print(f"Processing '{sub}' on depth {sub_depths[sub]}")
    print(f"Queue size: {len(sub_q)}")
    print(f"Num posts saved so far: {num_posts_saved}")

    posts = None
    try:
        # Download posts from the subreddit
        posts = list(reddit.subreddit(sub).top(limit=NUM_POSTS_FROM_SUB, time_filter="all"))
    except:
        print(f"ERROR: Cannot access '{sub}'")
        skipped_subs.append(sub)

    if posts is not None:
        if len(posts) < NUM_POSTS_FROM_SUB:
            print(f"Only {len(posts)} posts found")

        data_df = pd.DataFrame(
            [[post.title, post.score, post.id, post.url, post.num_comments, post.created, post.author, post.upvote_ratio, post.permalink, post.subreddit, post.subreddit_subscribers, sub_depths[sub]] for post in posts],
            columns=["title", "score", "id", "url", "num_comments", "created", "author", "upvote_ratio", "permalink", "subreddit", "subreddit_subscribers", "depth"],
        )

        # Filter out posts made by deleted users
        data_df = data_df[data_df["author"].notna()]

        # Keep only authors that posted more then MIN_TIMES_POSTED times
        data_df["author_name"] = data_df["author"].apply(lambda x: x.name)
        data_df = data_df.groupby("author_name").filter(lambda x: len(x) >= MIN_TIMES_POSTED)
        data_df = data_df.drop(columns=["author_name"])

        authors = data_df["author"].unique()
        num_posts = len(data_df)
        num_posts_saved += num_posts
        print(f"Num posts after filtering out: {num_posts} from {len(authors)} authors")

        # Check if we reached the max depth
        if sub_depths[sub] >= MAX_DEPTH:
            print("Max depth reached")
        else:
            for author in authors:
                try:
                    # Try to get submissions of the author
                    user_submissions = list(author.submissions.top(limit=NUM_POSTS_OF_USER, time_filter="all"))

                    # Extract subreddits from top user submissions and add them to the queue
                    for submission in user_submissions:
                        sub_name = submission.subreddit.display_name
                        if sub_name not in sub_depths:
                            sub_q.append(sub_name)
                            sub_depths[sub_name] = sub_depths[sub] + 1
                except:
                    print(f"ERROR: User submissions are private for '{author}'")

        # Save dataframe to csv
        data_df.to_csv(f"{DATA_PATH}/posts_{sub}.csv", index=False)

    # Save the script state to be able to resume in case of an error
    script_save = {
        "NUM_POSTS_FROM_SUB": NUM_POSTS_FROM_SUB,
        "NUM_POSTS_OF_USER": NUM_POSTS_OF_USER,
        "MIN_TIMES_POSTED": MIN_TIMES_POSTED,
        "MAX_DEPTH": MAX_DEPTH,
        "sub_q": sub_q,
        "sub_depths": sub_depths,
        "num_posts_saved": num_posts_saved,
        "skipped_subs": skipped_subs,
    }

    with open(SCRIPT_SAVE_PATH, 'wb') as f:
        pickle.dump(script_save, f)
Loaded script save. Resuming...
NUM_POSTS_FROM_SUB: 500
NUM_POSTS_OF_USER: 5
MIN_TIMES_POSTED: 2
MAX_DEPTH: 5
Number of subreddits in queue: 7314
Number of posts saved so far: 149894

I decided to stop the script manually after it collected around 150k posts even though it didn't reach the maximum depth. It took about 20h to collect that amount of data. I think that's enough for the project.

Data processing and analysis¶

In [13]:
# Create the main dataframe
posts_df = pd.DataFrame()

# Load all the raw csv files into the main dataframe
for _root, _dirs, files in os.walk(DATA_PATH):
    for file in files:
        if file.endswith(".csv"):
            posts_df = pd.concat([posts_df, pd.read_csv(os.path.join(DATA_PATH, file))], ignore_index=True)

display(posts_df.info())
display(posts_df.sample(5))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149894 entries, 0 to 149893
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   title                  149894 non-null  object 
 1   score                  149894 non-null  object 
 2   id                     149894 non-null  object 
 3   url                    149894 non-null  object 
 4   num_comments           149894 non-null  object 
 5   created                149894 non-null  float64
 6   author                 149894 non-null  object 
 7   upvote_ratio           149894 non-null  float64
 8   permalink              149894 non-null  object 
 9   subreddit              149894 non-null  object 
 10  subreddit_subscribers  149894 non-null  object 
 11  depth                  149894 non-null  object 
dtypes: float64(2), object(10)
memory usage: 13.7+ MB
None
title score id url num_comments created author upvote_ratio permalink subreddit subreddit_subscribers depth
88318 This looks like plastic, feels like plastic, b... 117971 kg5yxj https://v.redd.it/9oq4dntgl4661 2390 1.608376e+09 mohiemen 0.91 /r/nextfuckinglevel/comments/kg5yxj/this_looks... nextfuckinglevel 7785630 2
38468 Three Free EE Textbooks 106 9bggew https://www.reddit.com/r/ECE/comments/9bggew/t... 16 1.535603e+09 itstimeforanexitplan 0.99 /r/ECE/comments/9bggew/three_free_ee_textbooks/ ECE 154880 3
115676 Failed Attempt by a Security Guard to Stop a F... 61127 97jkm2 https://i.imgur.com/SLs41rI.gifv 1106 1.534350e+09 BunyipPouch 0.92 /r/sports/comments/97jkm2/failed_attempt_by_a_... sports 20617425 2
99413 McConnell blocks House bill to reopen governme... 85236 agabcf https://thehill.com/homenews/senate/425414-mcc... 7360 1.547570e+09 emitremmus27 0.85 /r/politics/comments/agabcf/mcconnell_blocks_h... politics 8294111 1
96225 PsBattle: A sculpture of a woman made out of w... 42731 bueyhn https://i.redd.it/r72ugyjao5131.jpg 501 1.559138e+09 fjordfjord 0.88 /r/photoshopbattles/comments/bueyhn/psbattle_a... photoshopbattles 18270824 2

The data I decided to collect has the following columns:

  • title - title of the post
  • score - score of the post
  • id - id of the post
  • url - url to content shared in the post (image, video, etc.)
  • num_comments - number of comments on the post
  • created - timestamp of the post creation
  • author - author of the post
  • upvote_ratio - ratio of upvotes to downvotes on the post
  • permalink - url to the post
  • subreddit - subreddit the post was posted in
  • subreddit_subscribers - number of subscribers to the subreddit the post was posted in
  • depth - depth of the subreddit from the starting subreddit

During creation of the networks presented in the next sections of the project I discovered some anomalies. After several hours of investigation I realized that some subreddits names and usernames are the same. That leads to problems with node identification in the network. I decided to add a prefix to the subreddit names and usernames to avoid those problems. The prefix for subreddits is r/ and for usernames is u/. This will solve the problem as / is not allowed in subreddit names and usernames.

In [14]:
posts_df["subreddit"] = posts_df["subreddit"].apply(lambda x: f"r/{x}")
posts_df["author"] = posts_df["author"].apply(lambda x: f"u/{x}")
posts_df.sample(5)
Out[14]:
title score id url num_comments created author upvote_ratio permalink subreddit subreddit_subscribers depth
42861 Ferrari World looks like a virus infecting the... 21742 cqx8gj https://i.imgur.com/bolY368.jpg 412 1.565909e+09 u/Ayo-Glam 0.92 /r/evilbuildings/comments/cqx8gj/ferrari_world... r/evilbuildings 1084637 3
113295 Knot (by More and More) 14219 9h55p7 https://gfycat.com/DefinitiveTepidGalapagosmoc... 284 1.537364e+09 u/KevlarYarmulke 0.98 /r/Simulated/comments/9h55p7/knot_by_more_and_... r/Simulated 1233460 2
101604 It's already been a year since Neil Peart pass... 149 ks67ex https://youtu.be/EsBNzf5JlZA 4 1.609996e+09 u/juanp2350 0.99 /r/progrockmusic/comments/ks67ex/its_already_b... r/progrockmusic 51318 2
73178 I spyk ze engliš very gud. 758 flqvzz https://i.redd.it/6hdtigua5sn41.jpg 70 1.584689e+09 u/KyouHarisen 0.98 /r/lithuania/comments/flqvzz/i_spyk_ze_engliš_... r/lithuania 90712 2
83625 Extra horsepower won't do any harm - GG 816 ic53kb https://i.redd.it/ulkwj6btmsh51.jpg 54 1.597771e+09 u/DontKillUncleBen 0.98 /r/motogp/comments/ic53kb/extra_horsepower_won... r/motogp 289502 2
In [8]:
num_subredits = len(posts_df["subreddit"].unique())
num_authors = len(posts_df["author"].unique())
num_posts = len(posts_df)

print(f"Collected {num_posts} posts from {num_subredits} subreddits and {num_authors} authors")
Collected 149894 posts from 1030 subreddits and 32311 authors

Let's analyze the number of posts collected from each subreddit.

In [9]:
num_posts_per_sub = posts_df.groupby("subreddit").size().reset_index(name="num_posts")
display(num_posts_per_sub.sample(10))

# Plot the density of the number of posts per subreddit
sns.histplot(num_posts_per_sub["num_posts"], stat="density", bins=100, kde=True)
plt.title("Density of the number of posts per subreddit")
plt.xlabel("Number of posts")
plt.ylabel("Density")
plt.show()
subreddit num_posts
904 r/trance 181
450 r/assassinscreed 167
315 r/PornhubComments 47
728 r/netsec 129
590 r/france 128
729 r/nevertellmetheodds 84
322 r/ProgrammingLanguages 237
860 r/submechanophobia 74
910 r/trees 65
627 r/hiphopheads 114

Because we decided to collect 500 posts from each subreddit (NUM_POSTS_FROM_SUB) and then disregarded posts from users that have less than 2 posts in that subreddit (MIN_TIMES_POSTED), the number of posts will always be 500 or less. If the subreddit has exactly 500 posts, that means that all the most popular posts were made by the same user. That would be highly unlikely for popular subreddits visited by many different people, so probably the subreddits from which we collected near to 500 posts are less popular. Let's check that.

In order to do that let's create a dataframe with number of subscribers of each subreddit. We have to use mean function and then round the result because during the data collection process the number of subscribers could have changed.

In [10]:
num_subscribers_per_sub = posts_df.groupby("subreddit").agg("subreddit_subscribers").mean().round().astype(int).reset_index(name="subscribers")
display(num_subscribers_per_sub.sample(10))
subreddit subscribers
902 r/totalwar 385442
632 r/holdmyjuicebox 745395
763 r/perfectloops 667931
331 r/Repsneakers 754961
291 r/PHP 156263
838 r/skyrim 1445471
211 r/Jokes 25595814
226 r/LateStageCapitalism 837508
863 r/suspiciouslyspecific 1257476
485 r/brooklynninenine 709603

Let's plot the number of posts collected from each subreddit against the number of subscribers to that subreddit. We have to use log scale for x axis because the number of subscribers is very skewed. As we cannot have value of 0 on log scale, we have to add 1 to the number of subscribers if the number of subscribers is 0

In [11]:
num_posts_and_subscribers_per_sub = num_posts_per_sub.merge(num_subscribers_per_sub, on="subreddit")
num_posts_and_subscribers_per_sub["subscribers"] = num_posts_and_subscribers_per_sub["subscribers"].apply(lambda x: 1 if x == 0 else x)

plt.figure(figsize=(20, 10))
sns.scatterplot(data=num_posts_and_subscribers_per_sub, x="subscribers", y="num_posts")
plt.xscale("log")
plt.title("Number of posts vs number of subscribers for each subreddit")
plt.xlabel("Number of subscribers")
plt.ylabel("Number of posts")
# Set the xticks, taking into account the trick of changing the 0 to 1
plt.xticks([10**i for i in range(9)], ["0", "10", "100", "1k", "10k", "100k", "1M", "10M", '100M'])

# Plot the line to mark the tendency of the data
plt.plot([10**i for i in range(3, 8)], [500/4*(4-i) for i in range(5)], color="red", linestyle="--")

plt.show()

We can see that the hypothesis is somewhat correct. Points in the most dense part of the graph tend to align with it (red dashed line).

Let's plot the distribution of the number of subscribers to subreddits from which we collected posts. We have already created the dataframe for the previous plot, so we can just use it. The plot below shows why we had to use log scale for the previous plot.

In [12]:
sns.histplot(num_subscribers_per_sub["subscribers"], log_scale=(False, True), stat="density", bins=100)
plt.title("Distribution of the number of subscribers per subreddit")
plt.xlabel("Number of subscribers")
plt.ylabel("Density")
plt.show()

Let's see the distribution of number of subreddits each user posted in. That is important as the number of subreddits each user posted in is the number of edges that will be created for the node representing that user (its degree) in the bipartite network presented in the next section.

In [15]:
num_subs_per_user = posts_df.groupby(["author", "subreddit"]).size().groupby("author").size().sort_values(ascending=False).reset_index(name="num_subs")
display(num_subs_per_user.head(10))

# Plot the density of the number of subreddits per author in log scale
sns.histplot(num_subs_per_user["num_subs"], discrete=True, stat="density", log_scale=(False, True))
plt.xlabel("Number of subreddits")
plt.title("Density of the number of subreddits per user")
plt.show()
author num_subs
0 u/My_Memes_Will_Cure_U 63
1 u/Master1718 60
2 u/memezzer 49
3 u/KevlarYarmulke 47
4 u/5_Frog_Margin 45
5 u/GallowBoob 40
6 u/Scaulbylausis 36
7 u/kevinowdziej 33
8 u/icant-chooseone 29
9 u/AristonD 28

We can see that user u/My_Memes_Will_Cure_U is the user that posted in the most subreddits. That means that the node representing that user will have the highest degree among all nodes representing users in the bipartite network.

Let's see the distribution of number of posts collected for each user.

In [14]:
# Plot the distribution of the number of posts per author
num_posts_per_user = posts_df.groupby("author").size().sort_values(ascending=False).reset_index(name="num_posts")
display(num_posts_per_user.head(10))

sns.histplot(num_posts_per_user["num_posts"], stat="density", bins=100)
plt.yscale("log")
plt.title("Density of the number of posts per author")
plt.xlabel("Number of posts")
plt.ylabel("Density")
plt.show()
author num_posts
0 u/SrGrafo 1077
1 u/GallowBoob 1069
2 u/Andromeda321 775
3 u/Yellyvi 763
4 u/My_Memes_Will_Cure_U 725
5 u/Unicornglitteryblood 516
6 u/pdwp90 506
7 u/ZadocPaet 485
8 u/mvea 450
9 u/flovringreen 430

We can see that the user u/My_Memes_Will_Cure_U is quite high up on the list. Another interesting fact is that we can observe an almost empty area on the x axis from around 500 posts to 1000 posts and then a sudden spike. That means that most of the users are somewhat active and a few users are extremely active on Reddit and there aren't many users in between.

We can also plot the number of posts each user posted against the number of subreddits each user posted in. In that way we can see if there is any correlation between those two values. We will also plot the line y = x/2 as each user had to post at least 2 (MIN_TIMES_POSTED) times in a given subreddit to be included in the dataset.

In [15]:
# Plot the number of posts of each user against the number of subreddits they posted in
num_posts_and_subs_per_user = num_posts_per_user.merge(num_subs_per_user, on="author")

plt.figure(figsize=(20, 10))
sns.scatterplot(data=num_posts_and_subs_per_user, x="num_posts", y="num_subs", alpha=0.3)
plt.xscale("log")
plt.title("Number of subreddits vs number of posts for each user")
plt.xlabel("Number of posts")
plt.ylabel("Number of subreddits")

# Plot y = 2x
x = [i for i in range(num_subs_per_user["num_subs"].max() * 2)]
plt.plot(x, [i/2 for i in x], color="red", linestyle="--")

plt.show()

Number of users that posted in each subreddit will also be important for the bipartite network as it will determine the number of edges that will be created for the node representing that subreddit (its degree). Let's see the distribution of that. Remember that we discarded users that posted less than 2 (MIN_TIMES_POSTED) times in the subreddit and we collected 500 (NUM_POSTS_FROM_SUB) posts from each subreddit so the maximum number of users that posted in a subreddit is 250 (that would mean that each user created exactly 2 posts among those 500 top posts).

In [16]:
num_users_per_sub = posts_df.groupby(["subreddit", "author"]).size().groupby("subreddit").size().sort_values(ascending=False).reset_index(name="num_users")
display(num_users_per_sub.head(10))

sns.histplot(num_users_per_sub["num_users"], stat="density", bins=100)
plt.title("Density of the number of users per subreddit")
plt.xlabel("Number of users")
plt.ylabel("Density")
plt.show()
subreddit num_users
0 r/generative 93
1 r/Unity2D 92
2 r/avatartrading 81
3 r/Cinema4D 81
4 r/dalmatians 80
5 r/KTMDuke 80
6 r/ukraine 80
7 r/turning 78
8 r/animation 78
9 r/Simulated 78

We can see that the subreddit r/generative is the subreddit from which we collected posts from the highest number of users. That means that the node representing that subreddit will have the highest degree among all nodes representing subreddits in the bipartite network.

We can also notice, that the distribution follows a normal distribution quite well, with an expected value of around 40. There is, however, a huge spike at value of 1. That means that there are a lot of subreddits from which we collected posts from only one user. That would make sense as there are some subreddits that have restricted posting permissions. Usually they are used as a private boards of an user.

It is also worth noting that on big subreddits it is hard to get one's post to the top 500. It is quite common that each of those 500 posts is made by a different user. Because I decided to discard posts of users that posted less than 2 times (MIN_TIMES_POSTED) in the subreddit, it is possible that usually there were only 1 user that managed to get 2 of his post to the top 500. Probably if we have removed that restriction, the distribution would have been much more right side heavy (in the area close to 500).

Bipartite network of subreddits and users¶

Creation¶

First network I decided to create is a bipartite network of subreddits and users. There will be an edge between a subreddit and a user if there are at least MIN_TIMES_POSTED posts from that user in that subreddit. There is also an edge attribute num_posts that stores the number of posts from that user in that subreddit. It can be then used to calculate the weight of the edge if needed.

In [17]:
# Create a dataframe with all the author-subreddit pairs
user_sub_pairs = posts_df.groupby(["author", "subreddit"]).size().reset_index(name="num_posts")
display(user_sub_pairs.head(10))

# Save the author-subreddit pairs to a csv file that could be imported to Cytoscape
user_sub_pairs.to_csv(f"{NETWORKS_PATH}/bipartite.csv", index=False)
author subreddit num_posts
0 u/--5- r/india 2
1 u/--CreativeUsername r/Physics 2
2 u/--Fatal-- r/homelab 2
3 u/--MVH-- r/Netherlands 4
4 u/--Speed-- r/logodesign 2
5 u/--UNFLAIRED-- r/carscirclejerk 2
6 u/--Yami_Marik-- r/WatchPeopleDieInside 3
7 u/--Yami_Marik-- r/holdmycosmo 2
8 u/-AMARYANA- r/Awwducational 2
9 u/-AMARYANA- r/Buddhism 7

The next thing I decided to do was to create csv files which contains the nodes' attributes that could be used to style the network in Cytoscape. First file contains the attributes of the subreddits:

  • id - id of the node
  • is_user - boolean value indicating if the node is a subreddit or a user, in this case it is always false
  • subscribers - number of subscribers of a subreddit
  • num_posts - number of posts collected from that subreddit
  • num_users - number of users that posted in that subreddit

Because we have already created a dataframe with number of subscribers of each subreddit, we can just use it to create the csv file.

In [18]:
sub_data = num_subscribers_per_sub.copy()
sub_data = sub_data.merge(num_posts_per_sub, on="subreddit")
sub_data = sub_data.merge(num_users_per_sub, on="subreddit")
sub_data = sub_data.rename(columns={"subreddit": "id"})
sub_data = sub_data.sort_values(by="subscribers", ascending=False)
# Add column `is_user` with value `False` to indicate that the nodes are subreddits
sub_data["is_user"] = False

display(sub_data.head(10))

# Save the dataframe to a csv file
sub_data.to_csv(f"{NETWORKS_PATH}/bipartite_sub_data.csv", index=False)
id subscribers num_posts num_users is_user
442 r/announcements 202719824 138 21 False
596 r/funny 48108476 60 17 False
42 r/AskReddit 40249936 13 4 False
604 r/gaming 36492322 75 22 False
461 r/aww 33655974 112 31 False
273 r/Music 32043294 145 37 False
1020 r/worldnews 31254656 133 37 False
899 r/todayilearned 31041441 114 39 False
720 r/movies 30617572 252 35 False
772 r/pics 29880182 72 24 False

The second file contains the attributes of the users:

  • id - id of the node
  • is_user - boolean value indicating if the node is a subreddit or a user, in this case it is always true
  • total_score - total score of all posts of a user
  • num_posts - total number of posts of a user
In [19]:
# Create a dataframe with author data
user_data = posts_df.groupby("author")["score"].sum().sort_values(ascending=False).reset_index()
user_data = user_data.merge(num_posts_per_user, on="author")
user_data = user_data.rename(columns={"score": "total_score", "author": "id"})

user_data["is_user"] = True
display(user_data.head(10))

# Save the dataframe to a csv file that could be imported to Cytoscape
user_data.to_csv(f"{NETWORKS_PATH}/bipartite_user_data.csv", index=False)
id total_score num_posts is_user
0 u/My_Memes_Will_Cure_U 28764321 725 True
1 u/beerbellybegone 24308427 260 True
2 u/mvea 20158958 450 True
3 u/GallowBoob 18798098 1069 True
4 u/SrGrafo 16470408 1077 True
5 u/Master1718 15458144 403 True
6 u/DaFunkJunkie 15101854 202 True
7 u/memezzer 11689024 283 True
8 u/unnaturalorder 10315964 158 True
9 u/kevinowdziej 9681202 187 True

I imported the user_sub_pairs dataframe from the previous section to Cytoscape but because the number of nodes was too big, Cytoscape was not able to calculate even the default (initial) layout. All the nodes were stacked on top of each other. I tried different layouts but the only one that managed to finish calculating itself was the circular layout. I decided to use that layout for the visualization of the network.

Furthermore, because of the problem with visualization, I decided to create a smaller network that could be visualized with other layouts. After filtrating out the posts from subreddits on depth 3 and above and posts in subreddits that have less than 500000 subscribers we are left with 57800 posts. That number of nodes created from those posts Cytoscape is able to process.

In [20]:
posts_df_filtered = posts_df[posts_df["depth"] <= 2]
print("Num of posts after 'depth <= 2':", len(posts_df_filtered))
posts_df_filtered = posts_df_filtered[posts_df_filtered["subreddit_subscribers"] >= 500000]
print("Num of posts after 'subreddit_subscribers >= 500000':", len(posts_df_filtered))

user_sub_pairs_filtered = posts_df_filtered.groupby(["author", "subreddit"]).size().reset_index(name="num_posts")
display(user_sub_pairs_filtered.head(10))

# Save the author-subreddit pairs to a csv file that could be imported to Cytoscape
user_sub_pairs_filtered.to_csv(f"{NETWORKS_PATH}/bipartite_filtered.csv", index=False)
Num of posts after 'depth <= 2': 121426
Num of posts after 'subreddit_subscribers >= 500000': 57800
author subreddit num_posts
0 u/--5- r/india 2
1 u/--CreativeUsername r/Physics 2
2 u/--Fatal-- r/homelab 2
3 u/--Yami_Marik-- r/WatchPeopleDieInside 3
4 u/--Yami_Marik-- r/holdmycosmo 2
5 u/-AMARYANA- r/Awwducational 2
6 u/-AMARYANA- r/Buddhism 7
7 u/-AMARYANA- r/Futurology 3
8 u/-AMARYANA- r/spaceporn 2
9 u/-ARIDA- r/photography 2

Cytoscape visualization¶

Below we cen see a visualization of the full bipartite network created in Cytoscape. Styles used in the visualization:

  • is_user - discrete mapping to colors: blue for True and red for False. (Red indicates that the node is a subreddit)
  • subscribers - continuous mapping to sizes of the nodes.
  • num_posts - continuous mapping to widths of the edges and their colors. (Darker and thicker edges indicate higher number of posts from a user in a subreddit)
  • total_score - no mapping. I would like to map total_score to the size of the nodes if a node is a user and subscribers to the size of the nodes if a node is a subreddit but unfortunately I didn't find a way to do that.

Bipartite network

As stated above, the circular layout is the only layout that managed to finish calculating itself. Nodes were sorted according to their type. We can see that all the edges from blue nodes (users) are pointing to the red ones (subreddits).

Below we can see a visualization of the smaller bipartite network created in Cytoscape. Styles used are the same as in the previous visualization.

Bipartite filtered network

Data summary of both networks:

# components # nodes # users # subs # edges
Unfiltered network 51 33 341 32 311 1 030 38 054
- Largest component 1 31 928 30 950 978 36 691
Filtered network 38 13 166 12 707 459 15 672
- Largest component 1 12 507 12 088 419 15 050

We can notice that the number of edges is lower than Metcalfe's law for social network would predict. The number of edges is linearly proportional to the number of nodes (~N) and not to N*log(N).

From the visualization of the filtered network we can see that it is mainly dissasortative by degree (a lot of star-like structures).

Other metrics relevant for the bipartite network (maximum degree of each partition, degree distributions across users and subreddits, etc.) were calculated in the previous section.

Metrics such as average degree, average clustering coefficient, average path length, etc. don't make much sense for bipartite networks and won't be analyzed here. I'll focus on them in the next section when we'll be analyzing subreddits and users projections.

Let's create the networkx graph of the network full network.

In [21]:
G_bipartite = nx.Graph()
# Add nodes to the graph marking their partitions
for row in user_data.iterrows():
    G_bipartite.add_node(
        row[1]["id"],
        bipartite="user",
        total_score=row[1]["total_score"],
        num_posts=row[1]["num_posts"],
    )

for row in sub_data.iterrows():
    G_bipartite.add_node(
        row[1]["id"],
        bipartite="sub",
        subscribers=row[1]["subscribers"],
        num_posts=row[1]["num_posts"],
        num_users=row[1]["num_users"],
    )

# Add edges to the graph
for row in user_sub_pairs.iterrows():
    G_bipartite.add_edge(
        row[1]["author"], row[1]["subreddit"], num_posts=row[1]["num_posts"]
    )

# Add degree as a node attribute
for node in G_bipartite.nodes():
    G_bipartite.nodes[node]["degree"] = G_bipartite.degree[node]

# Check if the graph is indeed bipartite
print(nx.is_bipartite(G_bipartite))
True

Users projection¶

Creation¶

Let's create the users projection of the network. The projection is a graph where nodes are users and edges are created between users if they have at least one common subreddit. That way we will get a network with nodes of the same type which will allow us to analyze other metrics and comparisons with model networks.

To create a projection we could use a build in networkx function projected_graph but it doesn't allow us to specify the number of common subreddits between them. Instead we will use the weighted_projected_graph function which can achieve that.

In [22]:
# Create a projection
users_nodes = [node for node in G_bipartite.nodes() if G_bipartite.nodes[node]["bipartite"] == "user"]
G_users = nx.bipartite.weighted_projected_graph(G_bipartite, users_nodes, ratio=False)
In [23]:
print("Number of nodes:", G_users.number_of_nodes())
print("Number of edges:", G_users.number_of_edges())

print("\nSample node:")
print(list(G_users.nodes(data=True))[0])

print("\nSample edge:")
print(list(G_users.edges(data=True))[0])
Number of nodes: 32311
Number of edges: 870230

Sample node:
('u/My_Memes_Will_Cure_U', {'bipartite': 'user', 'total_score': 28764321, 'num_posts': 725, 'degree': 63})

Sample edge:
('u/My_Memes_Will_Cure_U', 'u/Rredite', {'weight': 1})
In [24]:
# Rename edge attributes 'weight' to 'common_subs'
for edge in list(G_users.edges()):
    G_users.edges[edge]["num_common_subs"] = G_users.edges[edge].pop("weight")
In [25]:
num_common_subs = [G_users.edges[edge]["num_common_subs"] for edge in G_users.edges()]
# Plot the distribution of the number of common subs
sns.histplot(num_common_subs, stat="density", discrete=True)
plt.yscale("log")
plt.title("Distribution of the number of common subreddits between users")
plt.xlabel("Number of common subreddits")
plt.ylabel("Density")
plt.minorticks_on()
plt.show()

We can see that the distribution of the number of common subreddits is very 1-heavy. That means that vast majority of users have only one common subreddit. Because of that, this property will not be useful for styling the edges of the network.

In [26]:
# Save edgelist to csv file
with open(f"{NETWORKS_PATH}/users.csv", "w") as f:
    writer = csv.writer(f, delimiter=",", lineterminator="\n")
    writer.writerow(["source", "target", "num_common_subs"])
    for edge in G_users.edges(data=True):
        writer.writerow([edge[0], edge[1], edge[2]["num_common_subs"]])

Visualization¶

This network has a comparable number of nodes to the bipartite network but the number of edges is a magnitude larger. The projection is much denser and because of that, this time Cytoscape had even more issues dealing with the network.

Because of that I tried to use different tool called Gephi. It is a tool for network visualization and analysis similar to Cytoscape, but according to the documentation and other sources found online, it can handle much larger networks. Despite that I still had performance issues and couldn't work or style the network in a desired way. After spending some time looking for solutions I found out that the issue was not with the tool itself (because it should handle a network of that size) but with the available resources on my computer. Maybe in the future I will try to visualize the whole network on a more powerful computer.

Below is the only visualization I was able to create with Gephi. It uses OpenOrd algorithm for calculating the layout, which was created specifically for large networks visualization.

In order to be able to see the individual links, there color of nodes was mapped continuously to the number of posts collected from each user, and than the color of edges was set to the same color as the source node. Styling doesn't convey much information, but at least enables us to see some of the connections.

Gephi visualization of the users projection

Note that clearly visible dots at the edges of the network are not individual nodes, but clusters of many of them. Below you can see the zoomed in version of some nodes at the edge of the graph.

Zoomed in version of the users projection (edge)

Below you can see a zoomed in version of some nodes at the center of the graph.

Zoomed in version of the users projection (center)

Analysis¶

In [27]:
# Count the number of connected components in the graph
users_components = list(nx.connected_components(G_users))
print("Number of connected components:", len(users_components))
Number of connected components: 51
In [28]:
# Identify the largest connected component
users_components_sorted = sorted(users_components, key=len, reverse=True)
G_users_lcc = G_users.subgraph(users_components_sorted[0])
G_users_2nd_lcc = G_users.subgraph(users_components_sorted[1])

num_edges_complete_graph = G_users.number_of_nodes() * (G_users.number_of_nodes() - 1) / 2

components = [G_users, G_users_lcc, G_users_2nd_lcc]
components_data = []
for component in components:
    components_data.append(
        [
            component.number_of_nodes(),
            component.number_of_edges(),
            round(component.number_of_nodes() / G_users.number_of_nodes() * 100, 4),
            round(component.number_of_edges() / G_users.number_of_edges() * 100, 4),
            round(component.number_of_edges() / num_edges_complete_graph * 100, 4),
        ]
    )

table = [
    [
        "",
        "# nodes",
        "# edges",
        f"node % of\nthe network",
        f"edge % of\nthe network",
        f"edge % of the\ncomplete graph"
    ],
    ["Network", *components_data[0]], 
    ["LC", *components_data[1]],
    ["2nd LC", *components_data[2]],
]

print(tabulate(table, headers="firstrow", tablefmt="fancy_grid"))
╒═════════╤═══════════╤═══════════╤═══════════════╤═══════════════╤══════════════════╕
│         │   # nodes │   # edges │     node % of │     edge % of │    edge % of the │
│         │           │           │   the network │   the network │   complete graph │
╞═════════╪═══════════╪═══════════╪═══════════════╪═══════════════╪══════════════════╡
│ Network │     32311 │    870230 │      100      │      100      │           0.1667 │
├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤
│ LC      │     30950 │    843076 │       95.7878 │       96.8797 │           0.1615 │
├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤
│ 2nd LC  │        76 │      2850 │        0.2352 │        0.3275 │           0.0005 │
╘═════════╧═══════════╧═══════════╧═══════════════╧═══════════════╧══════════════════╛

We can see that the largest connected component contains the vast majority of nodes and is a good representation of the whole network, so analysis will be focused only on it. This will also allow us to compute statistics that are only defined for connected graphs.

Degree distribution¶

Let's create a function that will allow us to easily calculate the density of degrees for a given graph. It is necessary as using histograms when plotting comparisons of degree distributions is not easy to interpret.

In [29]:
def calculate_degree_densities(G: nx.Graph) -> pd.DataFrame:
    degrees = [G.degree[node] for node in G.nodes()]

    # Count the number of nodes with each degree
    degree_counts = defaultdict(int)
    for degree in degrees:
        degree_counts[degree] += 1

    # Create a dataframe with the degree and the number of nodes with that degree
    df = pd.DataFrame.from_dict(degree_counts, orient="index", columns=["count"]).reset_index()
    df = df.rename(columns={"index": "degree"})

    # Calculate the density of each degree
    df["density"] = df["count"] / G.number_of_nodes()
    return df
In [30]:
# Calculate the average degree of the largest connected component
G_users_lcc_avg_degree = sum([G_users_lcc.degree[node] for node in G_users_lcc.nodes()]) / G_users_lcc.number_of_nodes()

G_users_lcc_degrees = calculate_degree_densities(G_users_lcc)

# Plot the degree distribution on a scatter plot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of the largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")

# Plot the average degree as a vertical line
plt.axvline(x=G_users_lcc_avg_degree, color="red", linestyle="--", label=f"Average degree: {round(G_users_lcc_avg_degree, 2)}")
plt.legend()
plt.show()

The distribution doesn't look like any of the distributions we have seen in the class. Let's also see how the distribution looks like when we plot in in linear scale.

In [31]:
# Plot the degree distribution on a scatter plot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density")
plt.title("Degree distribution of the largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.show()

That's really interesting. It looks like the distribution is a mixture of two distributions. One of them is a power law and the other is some distribution with a single peak.

Let's see how it compares to to the Erdos-Renyi random graph with the same number of nodes and edges.

In [32]:
# Create Erdos-Renyi random graph with the same number of nodes and edges as the largest connected component
num_edges_complete_graph = G_users_lcc.number_of_nodes() * (G_users_lcc.number_of_nodes() - 1) / 2
G_users_lcc_ER = nx.erdos_renyi_graph(G_users_lcc.number_of_nodes(), G_users_lcc.number_of_edges() / num_edges_complete_graph, seed=42)

G_users_lcc_ER_degrees = calculate_degree_densities(G_users_lcc_ER)
In [33]:
# Plot both degree distributions on a scatter plot
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density", label="Users largest component")
sns.scatterplot(data=G_users_lcc_ER_degrees, x="degree", y="density", label="ER random graph")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of users largest connected component vs Erdos-Renyi random graph")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.legend()
plt.show()

We can see that the degree distribution of the users projection is very different from the one of the Erdos-Renyi random graph.

Even though I suspect that the Watts-Strogatz model will be even more different, I will still plot it for comparison.

Let's see how it compares to WS model with the same number of nodes and edges. In order to achieve that, at the beginning, each node should be connected to its k nearest neighbors, where k is the average degree of the users projection. I will plot degree distributions for several values of p (probability of rewiring each edge). Note, that the higher the value of p, the more edges will be rewired and the more the network will resemble the Erdos-Renyi random graph.

In [34]:
# Create a Watts-Strogatz small-world graph with the same number of nodes and edges as the largest connected component
values_of_p = [0.01, 0.1, 0.5]

plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density", label="Users largest component")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of users largest connected component vs Watts-Strogatz small-world graph")
plt.xlabel("Degree")
plt.ylabel("Density")

for i, p in enumerate(values_of_p):
    G_users_lcc_WS = nx.watts_strogatz_graph(G_users_lcc.number_of_nodes(), k=round(G_users_lcc_avg_degree), p=p, seed=42)
    G_users_lcc_WS_degrees = calculate_degree_densities(G_users_lcc_WS)
    # Decided to use a line plot instead of a scatter plot to make it easier to see the difference between graphs
    sns.lineplot(data=G_users_lcc_WS_degrees, x="degree", y="density", label=f"WS random graph (p={p})", color=f"C{i+1}")

plt.legend()
plt.show()

As suspected, the degree distribution of the users projection is even more different from the one of the Watts-Strogatz model.

Let's see how it compares to the Barabasi-Albert model. Barabasi-Albert model is a model of a scale free network with preferential attachment. In this model, new nodes are added to the network one by one and each new node is connected to m existing nodes. In order to achieve similar number of nodes to the users projection the parameter m should be set to avg_degree/2.

In [35]:
# Create a Barabasi-Albert scale-free graph with the same number of nodes and edges as the largest connected component
G_users_lcc_BA = nx.barabasi_albert_graph(G_users_lcc.number_of_nodes(), m=round(G_users_lcc_avg_degree/2), seed=42)

G_users_lcc_BA_degrees = calculate_degree_densities(G_users_lcc_BA)

plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_users_lcc_degrees, x="degree", y="density", alpha=0.5, label="Users largest component")
sns.scatterplot(data=G_users_lcc_BA_degrees, x="degree", y="density", alpha=0.7, label="BA random graph", marker="x")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of users largest connected component vs Barabasi-Albert scale-free graph")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.legend()
plt.show()

Finally we can notice some similarities! The distribution of degrees of the nodes with degree higher than 100 is very similar to the one of the Barabasi-Albert.

I suspect that if we run the data collection process for a longer period of time and managed to reach subreddits on much higher depths, the degree distribution for the whole network would approach the one of the Barabasi-Albert model.

I have tried to calculate other metrics (such as: node betweennes, average clustering coefficient, average path length) for the network and the created models, but they were taking too long to compute for such large networks. Because of that, I decided to skip them for the users projection and focus on them in the next section when we will be analyzing the subreddits projection.

Subreddits projection¶

Creation¶

Let's create the subreddits projection of the network. The projection is a graph where nodes are subreddits and edges are created between subreddits if they have at least one common user. That way we will get a network with nodes of the same type which will allow us to analyze other metrics and comparisons with model networks.

In [36]:
# Create a projection
subreddits_nodes = [node for node in G_bipartite.nodes() if G_bipartite.nodes[node]["bipartite"] == "sub"]
G_subreddits = nx.bipartite.weighted_projected_graph(G_bipartite, subreddits_nodes, ratio=False)
In [37]:
print(f"Number of nodes: {G_subreddits.number_of_nodes()}")
print(f"Number of edges: {G_subreddits.number_of_edges()}")

print("\nSample node:")
print(list(G_subreddits.nodes(data=True))[0])

print("\nSample edge:")
print(list(G_subreddits.edges(data=True))[0])
Number of nodes: 1030
Number of edges: 14920

Sample node:
('r/announcements', {'bipartite': 'sub', 'subscribers': 202719824, 'num_posts': 138, 'num_users': 21, 'degree': 21})

Sample edge:
('r/announcements', 'r/ModSupport', {'weight': 4})
In [38]:
# Rename edge attributes 'weight' to 'common_users'
for edge in G_subreddits.edges():
    G_subreddits.edges[edge]["common_users"] = G_subreddits.edges[edge].pop("weight")
In [39]:
num_common_users = [G_subreddits.edges[edge]["common_users"] for edge in G_subreddits.edges()]
# Plot the distribution of the number of common users
sns.histplot(num_common_users, stat="density", discrete=True)
plt.yscale("log")
plt.title("Distribution of the number of common users")
plt.xlabel("Number of common users")
plt.ylabel("Density")
plt.minorticks_on()
plt.show()

Compared to the users projection, the distribution of the number of common users is less skewed.

In [40]:
# Save edgelist to a csv file
with open(f"{NETWORKS_PATH}/subreddits.csv", "w") as f:
    writer = csv.writer(f, delimiter=",", lineterminator="\n")
    writer.writerow(["source", "target", "common_users"])
    for edge in G_subreddits.edges(data=True):
        writer.writerow([edge[0], edge[1], edge[2]["common_users"]])

Visualization¶

This network is much smaller and it will be much easier to visualize. Nevertheless, I will use Gephi instead of Cytoscape as I think it provides more interesting layouts.

Below is the visualization of the subreddits projection. It uses Fruchterman-Reingold algorithm for calculating the layout.

Styles used for the visualization:

  • subscribers - continuous mapping to the sizes of nodes.
  • num_users - continuous mapping to the colors of nodes (white -> low, red -> high).
  • common_users - continuous mapping to the colors of edges (yellow -> low, red -> high).
  • common_users - continuous mapping to the widths of edges.

Gephi visualization of the subreddits projection

We can see that there is a dense core of subreddits with a lot of connections between them. The core mainly consists of subreddits with high number of users. That makes sense as the more users a subreddit has, the more likely it there will be an user among them that posted to another subreddit.

We can also notice, that there are some subreddits with high number of users that aren't that well connected. It could be because I stopped the data collection process in an early stage, and poorly connected subreddits are the ones for whose users I haven't collected data. Let's check that by styling the nodes according to their depth.

In [41]:
sub_depths_df = pd.DataFrame.from_dict(sub_depths, orient="index", columns=["depth"]).reset_index().sort_values("depth", ascending=True)
sub_depths_df = sub_depths_df.rename(columns={"index": "id"})
# Add 'r/' to the beginning of the subreddit names
sub_depths_df["id"] = sub_depths_df["id"].apply(lambda x: f"r/{x}")
# Remove subreddits that are not nodes of G_subreddits
sub_depths_df = sub_depths_df[sub_depths_df["id"].isin(G_subreddits.nodes())]
display(sub_depths_df.head())

sub_depths_df.to_csv(f"{NETWORKS_PATH}/sub_depths.csv", index=False)
id depth
0 r/programming 0
25 r/pics 1
26 r/gifs 1
27 r/funny 1
28 r/WeatherGifs 1

Below we can see the same visualization as above, but with the depth of subreddits mapped to the colors:

  • Pink - depth 0
  • Green - depth 1
  • Blue - depth 2
  • Orange - depth 3

I also decided to set the size of nodes to a constant value, so it is easier to see the differences in the colors.

Gephi visualization of the subreddits projection (depth)

Analysis¶

In [42]:
subs_components = list(nx.connected_components(G_subreddits))
subs_components.sort(key=len, reverse=True)
print(f"Number of connected components: {len(subs_components)}")
Number of connected components: 51

As expected, the number of connected components is the same as for the users projection. That's because both projections are created from the same bipartite graph.

In [43]:
# Identify the largest connected component
G_subs_lcc = G_subreddits.subgraph(subs_components[0])
G_subs_2nd_lcc = G_subreddits.subgraph(subs_components[1])

num_edges_complete_graph = G_subreddits.number_of_nodes() * (G_subreddits.number_of_nodes() - 1) / 2

components = [G_subreddits, G_subs_lcc, G_subs_2nd_lcc]
components_data = []
for component in components:
    components_data.append(
        [
            component.number_of_nodes(),
            component.number_of_edges(),
            round(component.number_of_nodes() / G_subreddits.number_of_nodes() * 100, 4),
            round(component.number_of_edges() / G_subreddits.number_of_edges() * 100, 4),
            round(component.number_of_edges() / num_edges_complete_graph * 100, 4),
        ]
    )

table = [
    [
        "",
        "# nodes",
        "# edges",
        f"node % of\nthe network",
        f"edge % of\nthe network",
        f"edge % of the\ncomplete graph"
    ],
    ["Network", *components_data[0]], 
    ["LC", *components_data[1]],
    ["2nd LC", *components_data[2]],
]

print(tabulate(table, headers="firstrow", tablefmt="fancy_grid"))
╒═════════╤═══════════╤═══════════╤═══════════════╤═══════════════╤══════════════════╕
│         │   # nodes │   # edges │     node % of │     edge % of │    edge % of the │
│         │           │           │   the network │   the network │   complete graph │
╞═════════╪═══════════╪═══════════╪═══════════════╪═══════════════╪══════════════════╡
│ Network │      1030 │     14920 │      100      │      100      │           2.8154 │
├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤
│ LC      │       978 │     14918 │       94.9515 │       99.9866 │           2.8151 │
├─────────┼───────────┼───────────┼───────────────┼───────────────┼──────────────────┤
│ 2nd LC  │         2 │         1 │        0.1942 │        0.0067 │           0.0002 │
╘═════════╧═══════════╧═══════════╧═══════════════╧═══════════════╧══════════════════╛

The same as with the users projection, I will focus only on the largest component as it is a good representation of the whole network (~95% of nodes).

Degree distribution¶

In [44]:
G_subs_lcc_avg_degree = sum([G_subs_lcc.degree(node) for node in G_subs_lcc.nodes()]) / G_subs_lcc.number_of_nodes()

G_subs_lcc_degrees = calculate_degree_densities(G_subs_lcc)

plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_subs_lcc_degrees, x="degree", y="density")
plt.title("Degree distribution of subreddits largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")

plt.axvline(x=G_subs_lcc_avg_degree, color="red", linestyle="--", label=f"Average degree: {round(G_subs_lcc_avg_degree, 2)}")
plt.legend()
plt.show()

The graph looks promising for the Barabasi-Albert model as it seems to follow the power law distribution. Let's plot it in log-log scale to confirm that.

In [45]:
plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_subs_lcc_degrees, x="degree", y="density")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of subreddits largest connected component")
plt.xlabel("Degree")
plt.ylabel("Density")

plt.axvline(x=G_subs_lcc_avg_degree, color="red", linestyle="--", label=f"Average degree: {round(G_subs_lcc_avg_degree, 2)}")
plt.legend()
plt.show()

It is not very clear. Let's try comparing it with the distribution of random Albert-Barabasi graph with the same number of nodes and edges. The explanation for the choose of the value of the parameter m is the same as for the users projection.

In [46]:
G_subs_lcc_BA = nx.barabasi_albert_graph(G_subs_lcc.number_of_nodes(), m=round(G_subs_lcc_avg_degree / 2), seed=42)

G_subs_lcc_BA_degrees = calculate_degree_densities(G_subs_lcc_BA)

plt.figure(figsize=(15, 10))
sns.scatterplot(data=G_subs_lcc_degrees, x="degree", y="density", label="Subreddits largest component")
sns.scatterplot(data=G_subs_lcc_BA_degrees, x="degree", y="density", label="BA random graph", marker="x")
plt.xscale("log")
plt.yscale("log")
plt.title("Degree distribution of subreddits largest connected component vs Barabasi-Albert scale-free graph")
plt.xlabel("Degree")
plt.ylabel("Density")
plt.legend()
plt.show()

Plotting the distributions denies my suspicion. So the subreddits projection, similarly to the users projection doesn't follow any of the models we have seen in the class.

Clustering coefficient vs average path length¶

Let's see how the clustering coefficient and the average path length of the projection compares to other models.

In [47]:
# Create remaining models
G_subs_lcc_ER = nx.erdos_renyi_graph(
        G_subs_lcc.number_of_nodes(),
        G_subs_lcc.number_of_edges() / (G_subs_lcc.number_of_nodes() * (G_subs_lcc.number_of_nodes() - 1) / 2),
        seed=42
    )

G_subs_lcc_WS_01 = nx.watts_strogatz_graph(
        G_subs_lcc.number_of_nodes(),
        round(G_subs_lcc_avg_degree),
        0.1,
        seed=42
    )

G_subs_lcc_WS_05 = nx.watts_strogatz_graph(
        G_subs_lcc.number_of_nodes(),
        round(G_subs_lcc_avg_degree),
        0.5,
        seed=42
    )
In [48]:
G_sub_lcc_models = {
    "Subreddits LCC": G_subs_lcc,
    "Erdos-Renyi": G_subs_lcc_ER,
    "Barabasi-Albert": G_subs_lcc_BA,
    "Watts-Strogatz (p=0.1)": G_subs_lcc_WS_01,
    "Watts-Strogatz (p=0.5)": G_subs_lcc_WS_05,
}

# Calculate clustering and average shortest path for each model
models_data = []
for model_name, model in G_sub_lcc_models.items():
    models_data.append(
        {
            "model": model_name,
            "clustering": nx.average_clustering(model),
            "avg_shortest_path": nx.average_shortest_path_length(model),
        }
    )

models_df = pd.DataFrame(models_data)
display(models_df)
model clustering avg_shortest_path
0 Subreddits LCC 0.395263 3.057360
1 Erdos-Renyi 0.030906 2.349040
2 Barabasi-Albert 0.081325 2.317469
3 Watts-Strogatz (p=0.1) 0.533901 2.792797
4 Watts-Strogatz (p=0.5) 0.111868 2.430817

Let's make sure that my assumption about the parameters for models is correct and I indeed create networks with similar number of edges

In [49]:
# Print num edges for each model
for model_name, model in G_sub_lcc_models.items():
    print(f"{model_name}: {model.number_of_edges()}")
Subreddits LCC: 14918
Erdos-Renyi: 14794
Barabasi-Albert: 14445
Watts-Strogatz (p=0.1): 14670
Watts-Strogatz (p=0.5): 14670
In [50]:
# Plot clustering and average shortest path
plt.figure(figsize=(12,10))
sns.barplot(data=models_df, x="model", y="clustering")
plt.title("Clustering coefficient of the models")
plt.xlabel("Model")
plt.ylabel("Clustering coefficient")
plt.grid(axis="y", alpha=0.5)
plt.show()

plt.figure(figsize=(12,10))
sns.barplot(data=models_df, x="model", y="avg_shortest_path")
plt.title("Average shortest path of the models")
plt.xlabel("Model")
plt.ylabel("Average shortest path")
plt.grid(axis="y", alpha=0.5)
plt.show()

When it comes to the clustering coefficient and the average path length, the profile of the subreddits projection is the most similar to the one of Watts-Strogatz (p=0.1) model.

Network shows a relatively high clustering coefficient. Average path length is the highest of all networks. That means that the subreddits projection is a worst example of a small world network.

Node centrality¶

Let's find the most central subreddits in the network. I will use the following metrics:

  • Degree Centrality - the number of neighbors of a node.
  • Betweenness Centrality - the number of shortest paths that pass through a node.
  • Closeness Centrality - the inverse of average distance to all other nodes.
  • Eigenvector Centrality - the sum of the centrality scores of the neighbors of a node.
In [51]:
# Calculate node centralities
G_subs_lcc_centrality = {
    "degree": nx.degree_centrality(G_subs_lcc),
    "closeness": nx.closeness_centrality(G_subs_lcc),
    "betweenness": nx.betweenness_centrality(G_subs_lcc),
    "eigenvector": nx.eigenvector_centrality(G_subs_lcc),
}
In [52]:
# Convert to dataframe
G_subs_lcc_centrality_df = pd.DataFrame(G_subs_lcc_centrality)
# Add average column
G_subs_lcc_centrality_df["average"] = G_subs_lcc_centrality_df.mean(axis=1)
# Change index to column and rename to 'subreddit'
G_subs_lcc_centrality_df.reset_index(inplace=True)
G_subs_lcc_centrality_df.rename(columns={"index": "subreddit"}, inplace=True)

display(G_subs_lcc_centrality_df.head())
subreddit degree closeness betweenness eigenvector average
0 r/announcements 0.012282 0.323189 0.000304 0.000765 0.084135
1 r/funny 0.054248 0.392685 0.012252 0.026350 0.121384
2 r/AskReddit 0.002047 0.293041 0.000029 0.000286 0.073851
3 r/gaming 0.088025 0.421484 0.023730 0.048522 0.145440
4 r/aww 0.133060 0.435189 0.007469 0.078944 0.163666
In [53]:
# Display top 5 subreddits for each centrality
for centrality in G_subs_lcc_centrality_df.columns[1:]:
    print(f"Top 5 subreddits by {centrality} centrality:")
    display(G_subs_lcc_centrality_df.sort_values(by=centrality, ascending=False).head(5))
Top 5 subreddits by degree centrality:
subreddit degree closeness betweenness eigenvector average
358 r/wholesomegifs 0.216991 0.486312 0.013632 0.116376 0.208328
139 r/BetterEveryLoop 0.209826 0.480098 0.014909 0.114699 0.204883
71 r/BeAmazed 0.204708 0.475889 0.014100 0.112649 0.201837
301 r/gifsthatkeepongiving 0.201638 0.469260 0.008975 0.114427 0.198575
471 r/blackpeoplegifs 0.201638 0.469035 0.007927 0.113781 0.198095
Top 5 subreddits by closeness centrality:
subreddit degree closeness betweenness eigenvector average
358 r/wholesomegifs 0.216991 0.486312 0.013632 0.116376 0.208328
139 r/BetterEveryLoop 0.209826 0.480098 0.014909 0.114699 0.204883
71 r/BeAmazed 0.204708 0.475889 0.014100 0.112649 0.201837
110 r/Eyebleach 0.194473 0.472894 0.007956 0.110669 0.196498
152 r/educationalgifs 0.189355 0.470843 0.010917 0.107165 0.194570
Top 5 subreddits by betweenness centrality:
subreddit degree closeness betweenness eigenvector average
3 r/gaming 0.088025 0.421484 0.023730 0.048522 0.145440
43 r/technology 0.100307 0.434415 0.023059 0.022403 0.145046
14 r/memes 0.042989 0.379270 0.019108 0.013435 0.113700
460 r/technews 0.089048 0.425894 0.017507 0.021620 0.138517
846 r/redesign 0.035824 0.373043 0.015943 0.006659 0.107867
Top 5 subreddits by eigenvector centrality:
subreddit degree closeness betweenness eigenvector average
358 r/wholesomegifs 0.216991 0.486312 0.013632 0.116376 0.208328
139 r/BetterEveryLoop 0.209826 0.480098 0.014909 0.114699 0.204883
301 r/gifsthatkeepongiving 0.201638 0.469260 0.008975 0.114427 0.198575
471 r/blackpeoplegifs 0.201638 0.469035 0.007927 0.113781 0.198095
71 r/BeAmazed 0.204708 0.475889 0.014100 0.112649 0.201837
Top 5 subreddits by average centrality:
subreddit degree closeness betweenness eigenvector average
358 r/wholesomegifs 0.216991 0.486312 0.013632 0.116376 0.208328
139 r/BetterEveryLoop 0.209826 0.480098 0.014909 0.114699 0.204883
71 r/BeAmazed 0.204708 0.475889 0.014100 0.112649 0.201837
301 r/gifsthatkeepongiving 0.201638 0.469260 0.008975 0.114427 0.198575
471 r/blackpeoplegifs 0.201638 0.469035 0.007927 0.113781 0.198095

Out of the 4 centrality metrics, in 3 of them r/wholsomegifs is the most central node.

At the beginning, I suspected that the most central subreddits will be the ones with the highest number of users. Let's check if that's the case.

In [54]:
G_subs_lcc_centrality_users = G_subs_lcc_centrality_df.merge(num_users_per_sub, on="subreddit")

# Calculate the mean centrality for each number of users
average_centrailties_per_num_of_users = G_subs_lcc_centrality_users.groupby("num_users").mean(numeric_only=True)

# Plot centrality vs number of users
fig, axes = plt.subplots(nrows=5, ncols=1, figsize=(12, 20))
fig.tight_layout(pad=3.0)

scatter_params = {
    "alpha": 0.3,
    "x": "num_users",
    "data": G_subs_lcc_centrality_users,
}

line_params = {
    "x": "num_users",
    "label": "Average",
    "data": average_centrailties_per_num_of_users,
    "color": "C1"
}

sns.scatterplot(y="degree", ax=axes[0], **scatter_params)
sns.lineplot(y="degree", ax=axes[0], **line_params)
axes[0].set_title("Degree centrality")
axes[0].set_ylabel("Centrality")

sns.scatterplot(y="closeness", ax=axes[1], **scatter_params)
sns.lineplot(y="closeness", ax=axes[1], **line_params)
axes[1].set_title("Closeness centrality")
axes[1].set_ylabel("Centrality")

sns.scatterplot(y="betweenness", ax=axes[2], **scatter_params)
sns.lineplot(y="betweenness", ax=axes[2], **line_params)
axes[2].set_title("Betweenness centrality")
axes[2].set_ylabel("Centrality")

sns.scatterplot(y="eigenvector", ax=axes[3], **scatter_params)
sns.lineplot(y="eigenvector", ax=axes[3], **line_params)
axes[3].set_title("Eigenvector centrality")
axes[3].set_ylabel("Centrality")

sns.scatterplot(y="average", ax=axes[4], **scatter_params)
sns.lineplot(y="average", ax=axes[4], **line_params)
axes[4].set_title("Average centrality")
axes[4].set_ylabel("Centrality")

plt.show()

We can see that none of the centralities are correlated with the number of users. That at first might seem surprising but if we look once again at the plot of distribution of number of subreddits per user:

Distribution of number of subreddits per user

we can notice that vast majority of users are connected to only one subreddit, and the number of subreddits the user is connected to will influence the number of subreddits there will be connected to the subreddits of the user in the subreddits projection. So it is more important (in the context of centrality) for a subreddit to have few users that have posted in many subreddits than to have many users that have posted in only one (as this doesn't create new connections for a subreddit in the projection).

Just to be sure let's check how r/wholesomegifs is ranked among other subreddits when it comes to the number of users, number of subscribers and number of posts collected.

In [55]:
# Save values of statistics of r/wholesomegifs
wholsomegifs_nums = {
    "num_users": num_users_per_sub[num_users_per_sub["subreddit"] == "r/wholesomegifs"]["num_users"].values[0],
    "num_posts": num_posts_per_sub[num_posts_per_sub["subreddit"] == "r/wholesomegifs"]["num_posts"].values[0],
    "subscribers": num_subscribers_per_sub[num_subscribers_per_sub["subreddit"] == "r/wholesomegifs"]["subscribers"].values[0]
}

# Sort values by number of users, posts, subscribers
num_users_rank = num_users_per_sub.sort_values(by="num_users", ascending=False)
num_posts_rank = num_posts_per_sub.sort_values(by="num_posts", ascending=False)
num_subscribers_rank = num_subscribers_per_sub.sort_values(by="subscribers", ascending=False)

# Keep only rows with unique number of users, posts, subscribers in order to exclude ties
num_users_rank = num_users_rank.drop_duplicates(subset="num_users").reset_index(drop=True)
num_posts_rank = num_posts_rank.drop_duplicates(subset="num_posts").reset_index(drop=True)
num_subscribers_rank = num_subscribers_rank.drop_duplicates(subset="subscribers").reset_index(drop=True)

# Find index of first rows containing the number of users, posts, subscribers of r/wholesomegifs
wholsomegifs_rank = {
    "num_users": num_users_rank[num_users_rank["num_users"] == wholsomegifs_nums["num_users"]].index[0] + 1,
    "num_posts": num_posts_rank[num_posts_rank["num_posts"] == wholsomegifs_nums["num_posts"]].index[0] + 1,
    "subscribers": num_subscribers_rank[num_subscribers_rank["subscribers"] == wholsomegifs_nums["subscribers"]].index[0] + 1,
}

# Print rank of r/wholesomegifs
print(f"r/wholesomegifs is ranked {wholsomegifs_rank['num_users']}/{len(num_users_rank)} by number of users")
print(f"r/wholesomegifs is ranked {wholsomegifs_rank['num_posts']}/{len(num_posts_rank)} by number of posts")
print(f"r/wholesomegifs is ranked {wholsomegifs_rank['subscribers']}/{len(num_subscribers_rank)} by number of subscribers")
r/wholesomegifs is ranked 35/81 by number of users
r/wholesomegifs is ranked 80/343 by number of posts
r/wholesomegifs is ranked 379/958 by number of subscribers

My theory seems to be correct as r/wholesomegifs doesn't take a lead in any of the metrics.

What is also worth noting is the fact that the most central subreddits when it comes to degree, closeness and eigenvector centrality are mainly subreddits that are focused on entertainment and humor (memes, gifs, videos, etc.) without any specific topic in mind, such as:

  • r/wholesomegifs - gifs that are supposed to make you feel good,
  • r/BetterEveryLoop - gifs that suppose to be better every time you watch them,
  • r/BeAmazed - gifs and videos that are supposed to amaze you,
  • r/gifsthatkeepongiving - gifs that are supposed to be funny every time you watch them,
  • r/Eyebleach - gifs that are supposed to be calming and cute,

but when it comes to betweenness centrality, the most central subreddits are more focused on specific topics and news, such as:

  • r/gaming - gaming news and discussions,
  • r/technology - technology news and novelties,
  • r/technews - technology news,
  • r/redesign - ideas and submissions for the Reddit platform redesign.

That could mean that subreddits that focus on humor and entertainment create a well defined communities that gather a lot people with similar interests and are visited regularly, while subreddits that focus on specific topics and news are more likely to be visited by people that in case of need will look for a specific information or post a question about a specific topic and then leave the subreddit. If that theory is correct the subreddits from the second group can act as a bridge that connects people from different communities and that's why they are more central in the betweenness centrality metric.

Part 2¶

Community detection¶

Community detection is a process of finding groups of nodes that are more densely connected to each other than to the rest of the network. It is a very useful tool for analyzing networks as it can help us to understand the structure of the network and to find the most important nodes.

The community analysis will be performed only on the largest connected component of subreddits projection as the size of the users projection, unfortunately makes the analysis very time consuming.

The analysis of the subreddits projection will be done using the Louvain method. The method is based on the modularity optimization. The modularity of a network is a measure of how well the network is partitioned into communities. I have chosen this method as it is one of the most efficient ones for large networks O(n*log(n)).

In [56]:
subs_lcc_lovain = list(nx.algorithms.community.louvain_communities(G_subs_lcc, seed=42))

print(f"Number of communities: {len(subs_lcc_lovain)}")
Number of communities: 11

Let's create function which maps each node to number representing its community.

In [57]:
def get_community_number(node, communities: list[set]) -> int:
    for i, community in enumerate(communities):
        if node in community:
            return i
    return -1
In [58]:
subs_lcc_data = pd.DataFrame(columns=["id", "community"])

subs_lcc_data["id"] = [node for node in G_subs_lcc.nodes()]
subs_lcc_data["community"] = subs_lcc_data["id"].apply(lambda x: get_community_number(x, subs_lcc_lovain))

display(subs_lcc_data.head(10))
id community
0 r/announcements 3
1 r/funny 1
2 r/AskReddit 3
3 r/gaming 1
4 r/aww 7
5 r/Music 8
6 r/worldnews 8
7 r/todayilearned 3
8 r/movies 1
9 r/pics 7

Let's join the communities dataframe with the subreddits dataframe to get more information about the communities and to make the data import to Gephi easier.

In [59]:
subs_lcc_data = subs_lcc_data.merge(sub_data, on="id")
display(subs_lcc_data.head(10))
id community subscribers num_posts num_users is_user
0 r/announcements 3 202719824 138 21 False
1 r/funny 1 48108476 60 17 False
2 r/AskReddit 3 40249936 13 4 False
3 r/gaming 1 36492322 75 22 False
4 r/aww 7 33655974 112 31 False
5 r/Music 8 32043294 145 37 False
6 r/worldnews 8 31254656 133 37 False
7 r/todayilearned 3 31041441 114 39 False
8 r/movies 1 30617572 252 35 False
9 r/pics 7 29880182 72 24 False

Save edgelist and node data of the subreddits projection largest connected component to files.

In [60]:
with open(os.path.join(NETWORKS_PATH, "subreddits_lcc.csv"), "w") as f:
    writer = csv.writer(f, delimiter=",", lineterminator="\n")
    writer.writerow(["source", "target", "common_users"])
    for edge in G_subs_lcc.edges(data=True):
        writer.writerow([edge[0], edge[1], edge[2]["common_users"]])

subs_lcc_data.to_csv(os.path.join(NETWORKS_PATH, "subreddits_lcc_data.csv"), index=False)

Visualization¶

I will use the circular layout with colors and the order assigned to nodes according to their communities.

Cytoscape visualization of the subreddits projection communities, circular

Other layout that may be helpful in visualization of size differences in communities may be the Group Attributes Layout.

Cytoscape visualization of the subreddits projection communities, group attributes

Analysis¶

In [4]:
subs_lcc_data = subs_lcc_data.rename(columns={"id": "subreddit"})
subs_lcc_data = subs_lcc_data.merge(G_subs_lcc_centrality_df, on="subreddit")

subs_lcc_data.head()
Out[4]:
subreddit community subscribers num_posts num_users is_user degree closeness betweenness eigenvector average
0 r/announcements 3 202719824 138 21 False 0.012282 0.323189 0.000304 0.000765 0.084135
1 r/funny 1 48108476 60 17 False 0.054248 0.392685 0.012252 0.026350 0.121384
2 r/AskReddit 3 40249936 13 4 False 0.002047 0.293041 0.000029 0.000286 0.073851
3 r/gaming 1 36492322 75 22 False 0.088025 0.421484 0.023730 0.048522 0.145440
4 r/aww 7 33655974 112 31 False 0.133060 0.435189 0.007469 0.078944 0.163666
In [5]:
communities = subs_lcc_data["community"].unique().tolist()
communities.sort()

# Plot the barplot with the number of subreddits in each community. Display the number of subreddits on top of each bar.
num_subreddits_per_community = {k: v for k, v in subs_lcc_data["community"].value_counts().items()}
plt.figure(figsize=(10, 5))
sns.barplot(x=communities, y=[num_subreddits_per_community[community] for community in communities], color="C0")
plt.title("Number of subreddits in each community")
plt.xlabel("Community")
plt.ylabel("Number of subreddits")

for i, community in enumerate(communities):
    plt.text(i, num_subreddits_per_community[community], num_subreddits_per_community[community], ha="center", va="bottom")

plt.show()

We can see that the communities differ in size a lot, but there can be distinguished 4 groups of them when it comes to their sizes:

  • 4 of them have sizes below 10 nodes,
  • 2 of them have sizes between 20 and 40,
  • 4 of them have sizes between 120 and 170,
  • and there is the biggest community with 321 nodes.

Let's try to analyze what the nodes in the communities have in common. For that reason I will plot the distributions of various metrics and properties for nodes in each community.

I will compare the distributions of the following values:

  • number of subscribers,
  • number of posts collected,
  • number of users,
  • degree centrality,
  • closeness centrality,
  • betweenness centrality,
  • eigenvector centrality.

Note that the centrality measures are calculated for the whole largest connected component of the subreddits projection.

In [9]:
attributes_to_compare = ["subscribers", "num_posts", "num_users", "degree", "closeness", "betweenness", "eigenvector"]
labels = ["Number of subscribers", "Number of posts", "Number of users", "Degree centrality", "Closeness centrality", "Betweenness centrality", "Eigenvector centrality"]

for attribute, x_label in zip(attributes_to_compare, labels):
    plt.figure(figsize=(10, 5))
    sns.boxplot(x="community", y=attribute, data=subs_lcc_data, order=communities, color="C0")
    plt.title(f"{x_label} in each community")
    plt.xlabel("Community")
    plt.ylabel(x_label)
    plt.show()

Plot of number of subscribers is not readable because of the outliers. Let's filter them out.

In [10]:
num_subscribers_threshold = 10000000

plt.figure(figsize=(10, 5))
sns.boxplot(x="community", y="subscribers", data=subs_lcc_data[subs_lcc_data["subscribers"] < num_subscribers_threshold], order=communities, color="C0")
plt.title("Number of subscribers in each community")
plt.xlabel("Community")
plt.ylabel("Number of subscribers")
plt.show()

We can see that for every metric/property there are some communities that have similar distributions of values. That means that none of those properties directly influences and defines the structure of communities, and the splits in their formation. It makes sense as the louvain method is based on connectivity and we would not suspect the subreddits with e.g. low number of subscribers or low values of centrality to be stronger connected to each other than to some nodes with higher values of those metrics.

That said it does not mean that it has no influence at all. We can clearly notice that communities 5 and 7 have the best distributions of every of the centrality metrics. That means that those communities were formed of nodes with high centralities and that those nodes are densely connected to each other. Those communities are the most central (important) in the network.

Now let's try to analyze the centrality measures withing the communities itself.

In [11]:
G_communities = {}

# Create the subgraphs of each community
for community in communities:
    nodes = subs_lcc_data[subs_lcc_data["community"] == community]["subreddit"].tolist()
    G_communities[community] = G_subs_lcc.subgraph(nodes)

# Calculate the node centralities in the subgraphs of each community
G_communities_centralities = {}
for community, network in G_communities.items():
    centralities = {
        "degree": nx.degree_centrality(network),
        "closeness": nx.closeness_centrality(network),
        "betweenness": nx.betweenness_centrality(network),
        "eigenvector": nx.eigenvector_centrality(network),
    }

    df = pd.DataFrame(centralities)
    df.reset_index(inplace=True)
    df = df.rename(columns={"index": "subreddit"})

    # Calculate the average centralities for each community
    G_communities_centralities[community] = {
        "df": df,
        "avg_degree": df["degree"].mean(),
        "avg_closeness": df["closeness"].mean(),
        "avg_betweenness": df["betweenness"].mean(),
        "avg_eigenvector": df["eigenvector"].mean(),
    }

It is suspected that centrality measures in the identified communities will be strongly affected by their sizes, so in order to easily compare the results i will arrange the bars in the plots according to the community sizes and display the number of nodes in each community once again.

In [12]:
# Plot the average centralities for each community
avg_centralities = ["avg_degree", "avg_closeness", "avg_betweenness", "avg_eigenvector"]
labels = ["Average degree centrality", "Average closeness centrality", "Average betweenness centrality", "Average eigenvector centrality"]

communities_by_sizes = sorted(communities, key=lambda x: num_subreddits_per_community[x], reverse=True)

for avg_centrality, label in zip(avg_centralities, labels):
    plt.figure(figsize=(10, 5))
    plt.bar(x=[f"{community}" for community in communities_by_sizes], height=[G_communities_centralities[community][avg_centrality] for community in communities_by_sizes])
    plt.title(f"{label} for each community")
    plt.xlabel("Community")
    plt.ylabel(label)
    plt.xticks(communities_by_sizes)

    # Plot sizes of the communitites
    for i, community in enumerate(communities_by_sizes):
        plt.text(i, G_communities_centralities[community][avg_centrality], f"{num_subreddits_per_community[community]} nodes", ha="center", va="bottom", fontsize=8)

We can see that the average degree centrality tends to be higher in smaller communities and reaches 1 in the communities 10, 0, and 4. Those are communities with 3 and 2 nodes so it makes sense as every node in those communities is connected to every other node. However we still can notice that communities 5 and 7 have higher average degree centrality than expected from their sizes.

The same goes for the average closeness centrality.

Betweenness centrality looks a little bit different. It also tends to be higher in smaller communities, but for the fully connected communities 10, 0, and 4 it is equal to 0. It makes sense as in the fully connected communities there are no nodes that are in the middle of the shortest paths between other nodes. This time the communities 5 and 7 have lower average betweenness centrality than expected from their sizes. That, based on the earlier observations, is expected as nodes in those communities are densely connected to each other (relatively high degree and betweeness centrality) so there are less nodes that act as a bridges on the shortest paths between other nodes.

When it comes to the eignvector centrality there aren't any interesting observations about communitites 5 and 7 to be made.

Let's identify most central nodes in the communities and compare it to the results of the centralities analysis of the whole largest connected component of the subreddits projection. I'll do it only for the communities with more than 120 nodes as the other ones are too small to make any conclusions.

In [13]:
for community, stats in G_communities_centralities.items():
    if num_subreddits_per_community[community] < 120:
        continue

    print("=" * 50)
    print(f"Community {community}")
    df = stats["df"]
    for centrality in df.columns[1:]:
        print(f"Top 5 subreddits by {centrality} centrality:")
        display(df.sort_values(by=centrality, ascending=False).head(5))
==================================================
Community 1
Top 5 subreddits by degree centrality:
subreddit degree closeness betweenness eigenvector
23 r/Games 0.207101 0.414216 0.061387 0.262288
149 r/anime 0.207101 0.411192 0.066271 0.249170
13 r/PS4 0.177515 0.376392 0.049306 0.230098
47 r/NintendoSwitch 0.171598 0.376392 0.011490 0.246041
32 r/manga 0.165680 0.414216 0.069636 0.212319
Top 5 subreddits by closeness centrality:
subreddit degree closeness betweenness eigenvector
32 r/manga 0.165680 0.414216 0.069636 0.212319
23 r/Games 0.207101 0.414216 0.061387 0.262288
149 r/anime 0.207101 0.411192 0.066271 0.249170
31 r/Animemes 0.112426 0.399527 0.102044 0.037119
77 r/xboxone 0.165680 0.396714 0.025082 0.231493
Top 5 subreddits by betweenness centrality:
subreddit degree closeness betweenness eigenvector
166 r/memes 0.130178 0.380631 0.122991 0.014298
6 r/whenthe 0.082840 0.359574 0.120841 0.007869
31 r/Animemes 0.112426 0.399527 0.102044 0.037119
9 r/dankmemes 0.112426 0.384966 0.087028 0.016052
122 r/sciencememes 0.065089 0.283557 0.076580 0.000511
Top 5 subreddits by eigenvector centrality:
subreddit degree closeness betweenness eigenvector
23 r/Games 0.207101 0.414216 0.061387 0.262288
149 r/anime 0.207101 0.411192 0.066271 0.249170
47 r/NintendoSwitch 0.171598 0.376392 0.011490 0.246041
77 r/xboxone 0.165680 0.396714 0.025082 0.231493
13 r/PS4 0.177515 0.376392 0.049306 0.230098
==================================================
Community 5
Top 5 subreddits by degree centrality:
subreddit degree closeness betweenness eigenvector
100 r/trippinthroughtime 0.705036 0.759563 0.018687 0.127546
135 r/MadeMeSmile 0.690647 0.751351 0.048629 0.126192
30 r/youseeingthisshit 0.690647 0.751351 0.012096 0.127460
64 r/instant_regret 0.690647 0.751351 0.029476 0.119631
62 r/toptalent 0.690647 0.751351 0.025570 0.128177
Top 5 subreddits by closeness centrality:
subreddit degree closeness betweenness eigenvector
100 r/trippinthroughtime 0.705036 0.759563 0.018687 0.127546
75 r/blackmagicfuckery 0.683453 0.751351 0.026652 0.127284
135 r/MadeMeSmile 0.690647 0.751351 0.048629 0.126192
30 r/youseeingthisshit 0.690647 0.751351 0.012096 0.127460
62 r/toptalent 0.690647 0.751351 0.025570 0.128177
Top 5 subreddits by betweenness centrality:
subreddit degree closeness betweenness eigenvector
79 r/Damnthatsinteresting 0.654676 0.735450 0.049689 0.120235
135 r/MadeMeSmile 0.690647 0.751351 0.048629 0.126192
108 r/holdmycosmo 0.553957 0.681373 0.038343 0.110630
55 r/nextfuckinglevel 0.525180 0.668269 0.029867 0.100198
64 r/instant_regret 0.690647 0.751351 0.029476 0.119631
Top 5 subreddits by eigenvector centrality:
subreddit degree closeness betweenness eigenvector
62 r/toptalent 0.690647 0.751351 0.025570 0.128177
100 r/trippinthroughtime 0.705036 0.759563 0.018687 0.127546
30 r/youseeingthisshit 0.690647 0.751351 0.012096 0.127460
133 r/BeAmazed 0.676259 0.743316 0.009630 0.127296
75 r/blackmagicfuckery 0.683453 0.751351 0.026652 0.127284
==================================================
Community 7
Top 5 subreddits by degree centrality:
subreddit degree closeness betweenness eigenvector
108 r/blackpeoplegifs 0.731092 0.777778 0.040147 0.159521
2 r/mechanical_gifs 0.722689 0.772727 0.029777 0.162109
101 r/wholesomegifs 0.705882 0.762821 0.020877 0.161621
5 r/BetterEveryLoop 0.689076 0.753165 0.027479 0.157184
9 r/whitepeoplegifs 0.663866 0.739130 0.026280 0.153699
Top 5 subreddits by closeness centrality:
subreddit degree closeness betweenness eigenvector
108 r/blackpeoplegifs 0.731092 0.777778 0.040147 0.159521
2 r/mechanical_gifs 0.722689 0.772727 0.029777 0.162109
101 r/wholesomegifs 0.705882 0.762821 0.020877 0.161621
5 r/BetterEveryLoop 0.689076 0.753165 0.027479 0.157184
9 r/whitepeoplegifs 0.663866 0.739130 0.026280 0.153699
Top 5 subreddits by betweenness centrality:
subreddit degree closeness betweenness eigenvector
111 r/aww 0.445378 0.619792 0.049672 0.102025
102 r/gifs 0.495798 0.632979 0.042218 0.115448
108 r/blackpeoplegifs 0.731092 0.777778 0.040147 0.159521
26 r/interestingasfuck 0.647059 0.725610 0.036692 0.140872
32 r/PraiseTheCameraMan 0.470588 0.632979 0.031756 0.120913
Top 5 subreddits by eigenvector centrality:
subreddit degree closeness betweenness eigenvector
2 r/mechanical_gifs 0.722689 0.772727 0.029777 0.162109
101 r/wholesomegifs 0.705882 0.762821 0.020877 0.161621
108 r/blackpeoplegifs 0.731092 0.777778 0.040147 0.159521
5 r/BetterEveryLoop 0.689076 0.753165 0.027479 0.157184
87 r/gifsthatkeepongiving 0.655462 0.730061 0.016129 0.154141
==================================================
Community 8
Top 5 subreddits by degree centrality:
subreddit degree closeness betweenness eigenvector
181 r/technology 0.20625 0.483384 0.072949 0.210070
77 r/technews 0.20000 0.480480 0.070091 0.204190
252 r/Economics 0.19375 0.474777 0.054601 0.203605
316 r/environment 0.19375 0.471976 0.046050 0.209977
190 r/Coronavirus 0.18750 0.455840 0.050940 0.197164
Top 5 subreddits by closeness centrality:
subreddit degree closeness betweenness eigenvector
181 r/technology 0.20625 0.483384 0.072949 0.210070
77 r/technews 0.20000 0.480480 0.070091 0.204190
252 r/Economics 0.19375 0.474777 0.054601 0.203605
316 r/environment 0.19375 0.471976 0.046050 0.209977
190 r/Coronavirus 0.18750 0.455840 0.050940 0.197164
Top 5 subreddits by betweenness centrality:
subreddit degree closeness betweenness eigenvector
181 r/technology 0.206250 0.483384 0.072949 0.210070
77 r/technews 0.200000 0.480480 0.070091 0.204190
268 r/opensource 0.159375 0.450704 0.062287 0.163320
252 r/Economics 0.193750 0.474777 0.054601 0.203605
190 r/Coronavirus 0.187500 0.455840 0.050940 0.197164
Top 5 subreddits by eigenvector centrality:
subreddit degree closeness betweenness eigenvector
181 r/technology 0.20625 0.483384 0.072949 0.210070
316 r/environment 0.19375 0.471976 0.046050 0.209977
77 r/technews 0.20000 0.480480 0.070091 0.204190
252 r/Economics 0.19375 0.474777 0.054601 0.203605
190 r/Coronavirus 0.18750 0.455840 0.050940 0.197164
==================================================
Community 9
Top 5 subreddits by degree centrality:
subreddit degree closeness betweenness eigenvector
73 r/ArchitecturePorn 0.140000 0.366748 0.233477 0.373842
63 r/spaceporn 0.113333 0.352113 0.183196 0.295997
55 r/Design 0.093333 0.342466 0.172616 0.139031
79 r/space 0.086667 0.305499 0.040154 0.182302
62 r/CatastrophicFailure 0.080000 0.317797 0.059784 0.212146
Top 5 subreddits by closeness centrality:
subreddit degree closeness betweenness eigenvector
73 r/ArchitecturePorn 0.140000 0.366748 0.233477 0.373842
63 r/spaceporn 0.113333 0.352113 0.183196 0.295997
55 r/Design 0.093333 0.342466 0.172616 0.139031
91 r/carporn 0.073333 0.325380 0.115234 0.228131
18 r/ImaginaryLandscapes 0.073333 0.321888 0.044280 0.252624
Top 5 subreddits by betweenness centrality:
subreddit degree closeness betweenness eigenvector
73 r/ArchitecturePorn 0.140000 0.366748 0.233477 0.373842
63 r/spaceporn 0.113333 0.352113 0.183196 0.295997
55 r/Design 0.093333 0.342466 0.172616 0.139031
92 r/F1Technical 0.060000 0.268817 0.123168 0.017416
91 r/carporn 0.073333 0.325380 0.115234 0.228131
Top 5 subreddits by eigenvector centrality:
subreddit degree closeness betweenness eigenvector
73 r/ArchitecturePorn 0.140000 0.366748 0.233477 0.373842
63 r/spaceporn 0.113333 0.352113 0.183196 0.295997
90 r/RoomPorn 0.080000 0.317125 0.023597 0.273666
18 r/ImaginaryLandscapes 0.073333 0.321888 0.044280 0.252624
91 r/carporn 0.073333 0.325380 0.115234 0.228131

We can see that most of the subreddits leading in the communities (when it comes to the degree centrality) are the same as the ones leading in the whole largest connected component. However it is worth noting that the subreddit that previously the most central one in 3 out of 4 metrics (r/wholsomegifs) is not the best in any of the metrics in the community 7. That makes sense as the most central nodes in the whole network will lose "relatively" the most upon the split into communities which leads to balancing of the centrality values in the subnetworks.

There is though much more interesting observation to be made. Just by looking at the centrally top subreddits in each community we can notice that the communities seem to be formed around specific topics. Let's investigate that.

Topic distributions in communities¶

In order to look into those topics, let's display all the subreddits in the communities that are in top NUM_TOP_SUBREDDITS of any of the centrality measures for each community.

In [14]:
NUM_TOP_SUBREDDITS = 10

top_centrality_subreddits_per_community = {}

for community in communities:
    df = G_communities_centralities[community]["df"]

    subreddit_set = set()
    for centrality in df.columns[1:]:
        top_subreddits = df.sort_values(by=centrality, ascending=False).head(10)["subreddit"].tolist()
        subreddit_set.update(top_subreddits)

    top_centrality_subreddits_per_community[community] = subreddit_set
In [15]:
display(communities)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
In [16]:
display(top_centrality_subreddits_per_community[0])
{'r/intermittentfasting', 'r/keto'}

We can see that the community 0 has two subreddits that are both related to the topics of diet.

  • r/intermittenfasting - a subreddit about a diet that focuses on when you eat rather than what you eat,
  • r/keto - a subreddit about a diet that focuses on eating low carb and high fat food.
In [17]:
display(top_centrality_subreddits_per_community[1])
{'r/Animemes',
 'r/Games',
 'r/Konosuba',
 'r/NintendoSwitch',
 'r/PS4',
 'r/PS5',
 'r/PoliticalCompassMemes',
 'r/ShingekiNoKyojin',
 'r/XboxSeriesX',
 'r/anime',
 'r/dankmemes',
 'r/horizon',
 'r/manga',
 'r/memes',
 'r/movies',
 'r/nintendo',
 'r/sciencememes',
 'r/television',
 'r/totalwar',
 'r/whenthe',
 'r/xboxone'}

Community 1 has subreddits that are related mainly to topics of video games and anime. For example:

  • r/Animemes - a subreddit about anime memes,
  • r/Games - a subreddit about video games,
  • r/Konsuba - a subreddit about an anime series.
  • r/NintendoSwitch - a subreddit about Nintendo Switch video games console.
  • r/PS4 - a subreddit about video games on PlayStation 4 console.
  • r/PS5 - a subreddit about video games on PlayStation 5 console.
  • r/ShingekiNoKyojin - a subreddit about an anime series.
  • r/XboxSeriesX - a subreddit about video games on Xbox Series X console.
  • r/anime - a subreddit about anime in general.
  • r/horizon - a subreddit about a video game Horizon Zero Dawn.
  • r/manga - a subreddit about manga in general.
  • r/nintendo - a subreddit about video games on Nintendo consoles.
  • r/totalwar - a subreddit about a video game Total War.
  • r/xboxone - a subreddit about video games on Xbox One console.
In [18]:
display(top_centrality_subreddits_per_community[2])
{'r/Deltarune',
 'r/DetroitBecomeHuman',
 'r/Fallout',
 'r/SampleSize',
 'r/fo76',
 'r/personalfinance',
 'r/skyrim'}

In the community 2 out of 7 subreddits in the community 3 are related to popular games developed by Bethesda Softworks:

  • r/Fallout - a subreddit about a Fallout video game series.
  • r/fo76 - a subreddit about a Fallout 76 video game.
  • r/skirym - a subreddit about a The Elder Scrolls V: Skyrim video game.
In [19]:
display(top_centrality_subreddits_per_community[3])
{'r/AskHistorians',
 'r/DIY',
 'r/Imposter',
 'r/MovieDetails',
 'r/SubredditAdoption',
 'r/announcements',
 'r/blog',
 'r/changelog',
 'r/modnews',
 'r/place',
 'r/redditsecurity',
 'r/redesign',
 'r/self',
 'r/thebutton',
 'r/todayilearned',
 'r/woodworking'}

This community covers wider range of topics. But we can still distinguish some similarities.

Subreddits that are related to the Reddit itself (its features, rules, updates, etc.):

  • r/announcements - a subreddit for Reddit announcements supplied by Reddit developers to inform the community about the most important changes and updates made to the platform.
  • r/blog - a subreddit with the official blog posts from Reddit Inc.
  • r/changelog - a subreddit with the changelog of the Reddit platform.
  • r/modnews - a subreddit with the news for Reddit moderators.
  • r/redesign - a subreddit about the changes to the design aspect of the Reddit.
  • r/redditsecurity - a subreddit with the security updates for Reddit users.

Untypical subreddits that engages the community in some activities:

  • r/Imposter - a subreddit where users can play a game similar in concept to the Among Us video game.
  • r/place - a subreddit where users can place a pixel on a big canvas every 5 minutes.
  • r/thebutton - a subreddit where users can press a button that resets every 60 seconds.

Knowledge sharing subreddits:

  • r/AskHistorians - a subreddit where users can ask questions about history and get answers from verified historians.
  • r/DIY - a subreddit where users can share their do-it-yourself projects.
  • r/MovieDetails - a subreddit where users can share interesting details about movies.
  • r/todayilearned - a subreddit where users can share interesting facts that they learned today.
In [20]:
display(top_centrality_subreddits_per_community[4])
{'r/Warthunder', 'r/WorldOfWarships'}

This community contains 2 subreddits related to the topic of war video games:

  • r/Warthunder - a subreddit about a War Thunder video game.
  • r/WorldOfWarships - a subreddit about a World of Warships video game.
In [21]:
display(top_centrality_subreddits_per_community[5])
{'r/BeAmazed',
 'r/Damnthatsinteresting',
 'r/MEOW_IRL',
 'r/MadeMeSmile',
 'r/NatureIsFuckingLit',
 'r/SweatyPalms',
 'r/Whatcouldgowrong',
 'r/WhitePeopleTwitter',
 'r/blackmagicfuckery',
 'r/holdmycosmo',
 'r/instant_regret',
 'r/maybemaybemaybe',
 'r/nextfuckinglevel',
 'r/toptalent',
 'r/trippinthroughtime',
 'r/youseeingthisshit'}

Community 5 contains mostly subreddits focused on astonishing or impressive content:

  • r/BeAmazed - a subreddit with content that is supposed to amaze the users.
  • r/Damnthatsinteresting - a subreddit with content that is supposed to be interesting.
  • r/NatureIsFuckingLit - a subreddit in which people post about various interesting and cool facts related to nature.
  • r/nextfuckinglevel - a subreddit dedicated to showcasing content that exemplifies exceptional skills, extraordinary achievements, or remarkable advancements in various fields.
  • r/toptalent - a subreddit dedicated to showcasing exceptional talents and skills demonstrated by individuals.
  • r/youseeingthisshit - a subreddit that focuses on sharing and discussing extraordinary or unbelievable moments captured in videos or images.
  • r/humansaremetal - a subreddit that celebrates and showcases the incredible capabilities, resilience, and accomplishments of human beings.
In [22]:
display(top_centrality_subreddits_per_community[6])
{'r/AmateurRoomPorn',
 'r/Catswhoyell',
 'r/Gameboy',
 'r/Honda',
 'r/Idiotswithguns',
 'r/JDM',
 'r/Lexus',
 'r/Mid_Century',
 'r/OneSecondBeforeDisast',
 'r/ThriftStoreHauls',
 'r/fuckcars',
 'r/funnyvideos',
 'r/gamecollecting',
 'r/gtaonline',
 'r/politecats',
 'r/rally',
 'r/tiktokcringemoment'}

In this community we can find subreddit that are related to more niche interests and hobbies. I don't know the majority of them so it is hard for me to asses what could be the factor that connects them topically.

In [23]:
display(top_centrality_subreddits_per_community[7])
{'r/BetterEveryLoop',
 'r/PraiseTheCameraMan',
 'r/WatchPeopleDieInside',
 'r/WeatherGifs',
 'r/aww',
 'r/blackpeoplegifs',
 'r/chemicalreactiongifs',
 'r/gifs',
 'r/gifsthatkeepongiving',
 'r/interestingasfuck',
 'r/lifehacks',
 'r/mechanical_gifs',
 'r/southpark',
 'r/whitepeoplegifs',
 'r/wholesomegifs',
 'r/woahdude'}

The most common similarity among the majority of the listed subreddits is that they focus on sharing and discussing various types of animated content, particularly gifs.

  • r/BetterEveryLoop - a subreddit is dedicated to sharing gifs or short videos that loop seamlessly and improve with each repetition, showcasing satisfying or impressive content.
  • r/WeatherGifs - a subreddit dedicated to gifs that showcase weather phenomena.
  • r/blackpeoplegifs - a subreddit is dedicated to sharing gifs featuring black individuals in various situations, often with a humorous or relatable context.
  • chemicalreactiongifs - a subreddit dedicated to gifs that showcase chemical reactions.
  • r/gifs - a subreddit dedicated to sharing gifs in general.
  • r/gifsthatkeepongiving - a subreddit is all about gifs that have an unexpected or continuous loop, creating humorous or mesmerizing effects.
  • r/mechanical_gifs - a subreddit dedicated to gifs that showcase mechanical processes.
  • whitepeoplegifs - a subreddit is dedicated to sharing gifs featuring white individuals in various situations, often with a humorous or relatable context.
  • r/wholsomegifs - a subreddit focuses on sharing heartwarming and uplifting gifs or videos that evoke positive emotions.
In [24]:
display(top_centrality_subreddits_per_community[8])
{'r/Coronavirus',
 'r/Economics',
 'r/EverythingScience',
 'r/Futurology',
 'r/UpliftingNews',
 'r/environment',
 'r/opensource',
 'r/politics',
 'r/privacy',
 'r/technews',
 'r/technology'}

Those subreddits focus on topics related to science, technology, economics, and societal issues.

  • r/Coronavirus - a subreddit dedicated to the 2019 coronavirus COVID-19 outbreak.
  • r/Economics - a subreddit dedicated to the science of economics and the discussion of issues and news related to it.
  • r/EverythingScience - a subreddit dedicated to the discussion of science and scientific phenomena.
  • r/Futurology - a subreddit dedicated to the discussion of future developments in science and technology.
  • r/UpliftingNews - a subreddit dedicated to sharing news that are uplifting and positive.
  • r/environment - a subreddit dedicated to the discussion of environmental issues.
  • r/politics - a subreddit dedicated to the discussion of political issues.
  • r/technews - a subreddit dedicated to the discussion of technology news.
  • r/technology - a subreddit dedicated to the discussion of technology.
In [25]:
display(top_centrality_subreddits_per_community[9])
{'r/ArchitecturePorn',
 'r/Breath_of_the_Wild',
 'r/CatastrophicFailure',
 'r/Cyberpunk',
 'r/Design',
 'r/DiWHY',
 'r/F1Technical',
 'r/ImaginaryLandscapes',
 'r/MacroPorn',
 'r/Outdoors',
 'r/RoomPorn',
 'r/TechnicalDeathMetal',
 'r/arduino',
 'r/astrophotography',
 'r/carporn',
 'r/europe',
 'r/nostalgia',
 'r/photography',
 'r/risa',
 'r/space',
 'r/spaceporn',
 'r/wow'}

Majority of subreddits in this community focus on design visual content and aesthetics and photography.

Architecture and design:

  • r/ArchitecturePorn - a subreddit dedicated to sharing images of interesting architecture.
  • r/Design - a subreddit dedicated to the discussion of design in general.
  • r/RoomPorn - a subreddit dedicated to sharing images of interior design and aesthetically pleasing rooms.

Landscapes and outdoors:

  • r/Outdoors - a subreddit dedicated to the discussion of pleasing outdoor pictures.
  • r/ImaginaryLandscapes - a subreddit dedicated to sharing images of dreamlike landscapes.
  • r/astrophotography - a subreddit dedicated to sharing images of space and celestial bodies.
  • r/spaceporn - a subreddit dedicated to sharing images of space and celestial bodies.
  • r/space - a subreddit dedicated to the discussion of space and space exploration.
In [26]:
display(top_centrality_subreddits_per_community[10])
{'r/BiggerThanYouThought', 'r/bigtiddygothgf', 'r/u_nicolebun'}

Those subreddits are related to porn or explicit content.

Conclusion¶

We can see that the subreddits were spilt into communities according to the topics around which they are centered. The communities are not perfectly separated, but we can still see clear patterns.

This outcome is not theoretically surprising as the subreddits in the projection are connected when they have an common user who posted in them, and people tend to post in the subreddits that are related to their interests. Therefore, the subreddits that are related to the same topic are more likely to be connected in the projection. What is surprising to me is the fact that this topic similarity is so clear and that the subreddits are so well separated into communities.

Robustness & percolation¶

How to give sense to the question of robustness?¶

Robustness is a measure of how well the network can withstand the removal of nodes. It is a very important property of networks as it can help us to understand how the network will behave in case of failure of some of its nodes.

The bipartite network that we have created can show us how information could flow between users and subreddits. Let's imagine that one user has posted something to a subreddit. Other users on that subreddit could see the post and decide to pass the information to other subreddits they tend to visit. That way the information could spread through the network.

In the context of the subreddits projection, it is highly unlikely that the subreddits would be removed from the network. Communities rarely are closed/banned from Reddit. It is much more likely that some of the users would be removed from the network. That could happen if they were to be banned from Reddit or if they were to delete their accounts.

So in order to give sense to the question of robustness, I will analyze how the subreddits projection would behave if some of the users were to be removed from the network. That approach is a mix of:

  • robustness for the users projection (node removal),
  • percolation for the subreddits projection (edge removal),

in the context of effect on the subreddits projection.

Preparation¶

In order to perform the analysis, I need to create a dataset that would contain all the common users between each pair of subreddits. I will use previously created user_sub_pairs dataframe.

In [34]:
display(user_sub_pairs.shape)
display(user_sub_pairs.head())
(38054, 3)
author subreddit num_posts
0 u/--5- r/india 2
1 u/--CreativeUsername r/Physics 2
2 u/--Fatal-- r/homelab 2
3 u/--MVH-- r/Netherlands 4
4 u/--Speed-- r/logodesign 2

Let's keep only the rows with subreddits that are in the largest connected component

In [7]:
user_sub_pairs_subs_lcc = user_sub_pairs[user_sub_pairs["subreddit"].isin(subs_lcc_data["subreddit"].tolist())]
display(user_sub_pairs_subs_lcc.shape)
display(user_sub_pairs_subs_lcc.head())
(36691, 3)
author subreddit num_posts
0 u/--5- r/india 2
1 u/--CreativeUsername r/Physics 2
2 u/--Fatal-- r/homelab 2
3 u/--MVH-- r/Netherlands 4
4 u/--Speed-- r/logodesign 2

Let's create the list of users for each subreddit

In [9]:
users_per_sub_subs_lcc = user_sub_pairs_subs_lcc.groupby("subreddit")["author"].apply(list).reset_index()
display(users_per_sub_subs_lcc.head())
subreddit author
0 r/13or30 [u/Balls-over-dick-man-, u/FormerFruit, u/TheS...
1 r/196 [u/1milionand6times, u/Alex9586, u/Anormalredd...
2 r/2020PoliceBrutality [u/ApartheidReddit, u/ApartheidUSA, u/BiafraMa...
3 r/2meirl4meirl [u/-wao, u/9w_lf9, u/ArticckK, u/BlueBerryChar...
4 r/3Dprinting [u/3demonster, u/Antique_Steel, u/Bigbore_729,...

Let's create the sub-sub dataframe with the list of common users for each pair of subreddits

In [10]:
frames = []
for i, row_1 in enumerate(users_per_sub_subs_lcc.itertuples()):
    for j, row_2 in enumerate(users_per_sub_subs_lcc[i+1:].itertuples()):
        common_users = set(row_1.author).intersection(row_2.author)
        if len(common_users) > 0:
            frames.append([row_1.subreddit, row_2.subreddit, common_users])

sub_sub_common_users_subs_lcc = pd.DataFrame(frames, columns=["subreddit_1", "subreddit_2", "common_users"])
display(sub_sub_common_users_subs_lcc.head())
subreddit_1 subreddit_2 common_users
0 r/13or30 r/AbsoluteUnits {u/FormerFruit}
1 r/13or30 r/MrRobot {u/FormerFruit}
2 r/13or30 r/extremelyinfuriating {u/FormerFruit}
3 r/13or30 r/foxes {u/FormerFruit}
4 r/13or30 r/interestingasfuck {u/prolelol}

Other helpful dataframe will be the one with the number of subreddits each user has posted in. We will use previously created num_subs_per_user dataframe.

In [16]:
num_subs_per_user_subs_lcc = num_subs_per_user[num_subs_per_user["author"].isin(user_sub_pairs_subs_lcc["author"].tolist())]
display(num_subs_per_user_subs_lcc.head())
author num_subs
0 u/My_Memes_Will_Cure_U 63
1 u/Master1718 60
2 u/memezzer 49
3 u/KevlarYarmulke 47
4 u/5_Frog_Margin 45

Analysis¶

The analysis will be performed in the following way:

  1. Create a networkx graph from the sub-sub dataframe (largest connected component in the subreddits projection) with the set of users as the attribute of each edge.
  2. Choose a random user from the set of users in the graph.
  3. Remove the user from all the sets of users of the edges that contain the user.
  4. Remove all the edges that have an empty set of users.
  5. Calculate the global efficiency of the graph.
  6. Repeat steps 2-5 for desired number of iterations.
In [113]:
G_subs_lcc = nx.from_pandas_edgelist(sub_sub_common_users_subs_lcc, source="subreddit_1", target="subreddit_2", edge_attr="common_users")

Let's see how the network looks like.

In [91]:
nx.draw(G_subs_lcc, pos=nx.spring_layout(G_subs_lcc, seed=42), node_size=10, width=0.1)
In [95]:
# In order to speed up the calculation, let's create a dictionary
# that maps each user to the edges that contain it
user_edges_subs_lcc = defaultdict(list)
for edge in G_subs_lcc.edges(data=True):
    for user in edge[2]["common_users"]:
        user_edges_subs_lcc[user].append((edge[0], edge[1]))

First let's analyze the situation in which we remove the users at random (random failure scenario).

In [100]:
np.random.seed(42)
users_subs_lcc = num_subs_per_user_subs_lcc["author"].tolist()
users_subs_lcc = np.random.permutation(users_subs_lcc)
# We have to use deepcopy because we will modify the sets of users
G_subs_lcc_random = deepcopy(G_subs_lcc)

subs_lcc_efficiencies_random = []
subs_lcc_efficiencies_random.append(nx.global_efficiency(G_subs_lcc_random))
subs_lcc_efficiencies_random_num_edges_removed = [0]

for user_to_remove in tqdm(users_subs_lcc):
    edges_with_user = user_edges_subs_lcc[user_to_remove]
    edges_to_remove = []

    for edge_name in edges_with_user:
        edge = G_subs_lcc_random.edges[edge_name]
        edge["common_users"].remove(user_to_remove)
        if len(edge["common_users"]) == 0:
            edges_to_remove.append(edge_name)
    
    G_subs_lcc_random.remove_edges_from(edges_to_remove)
    subs_lcc_efficiencies_random_num_edges_removed.append(len(edges_to_remove))

    # To speed up the calculation, I will only calculate the global efficiency
    # if I removed at least one edge
    # otherwise, we will just append the last value
    if len(edges_to_remove) > 0:
        subs_lcc_efficiencies_random.append(nx.global_efficiency(G_subs_lcc_random))
    else:
        subs_lcc_efficiencies_random.append(subs_lcc_efficiencies_random[-1])

    if G_subs_lcc_random.number_of_edges() == 0:
        break
        
100%|██████████| 30950/30950 [1:02:02<00:00,  8.31it/s] 

Now, let's analyze the situation in which we remove the users with the highest number of subreddits they have posted in (targeted attack scenario). This approach makes less sense as it is less likely that very active users would stop using Reddit, but it will be interesting to see how the network will behave in such scenario.

In [117]:
users_subs_lcc = num_subs_per_user_subs_lcc.sort_values(by="num_subs", ascending=False)["author"].tolist()
G_subs_lcc_targeted = deepcopy(G_subs_lcc)

subs_lcc_efficiencies_targeted = []
subs_lcc_efficiencies_targeted.append(nx.global_efficiency(G_subs_lcc_targeted))
subs_lcc_efficiencies_targeted_num_edges_removed = [0]

for user_to_remove in tqdm(users_subs_lcc):
    edges_with_user = user_edges_subs_lcc[user_to_remove]
    edges_to_remove = []

    for edge_name in edges_with_user:
        edge = G_subs_lcc_targeted.edges[edge_name]
        edge["common_users"].remove(user_to_remove)
        if len(edge["common_users"]) == 0:
            edges_to_remove.append(edge_name)

    G_subs_lcc_targeted.remove_edges_from(edges_to_remove)
    subs_lcc_efficiencies_targeted_num_edges_removed.append(len(edges_to_remove))

    if len(edges_to_remove) > 0:
        subs_lcc_efficiencies_targeted.append(nx.global_efficiency(G_subs_lcc_targeted))
    else:
        subs_lcc_efficiencies_targeted.append(subs_lcc_efficiencies_targeted[-1])

    if G_subs_lcc_targeted.number_of_edges() == 0:
        break
100%|██████████| 30950/30950 [1:01:32<00:00,  8.38it/s]  

Let's plot the global efficiency of the network in both scenarios.

In [130]:
plt.figure(figsize=(13, 10))
plt.plot(subs_lcc_efficiencies_random, label="Random")
plt.plot(subs_lcc_efficiencies_targeted, label="Targeted")
plt.title("Global efficiency of the subreddits LCC")
plt.xlabel("Number of users removed")
plt.ylabel("Global efficiency")
plt.legend()
plt.show()

We can see that the network is relatively resilient to random failures. However, it is not the case for targeted attacks. We can see that when the most active users are removed from the network, the global efficiency drops significantly. It is suspected as it is quite common for social networks to demonstrate Zipf's law behaviors. In this case it means that small amount of users are responsible for the majority of the network's connections. Keep in mind that (as stated before) it is quite unlikely that the most active users would stop using Reddit, so we can consider the network to be quite robust.

It would be interesting to compare the network to other random network models in the context of robustness. However it would not be easy as I do not analyze typical percolation or robustness scenarios, but rather a mix of them with somewhat of a edge-weighted approach.

Diffusion¶

Diffusion (similarly to previous sections) will be performed only on the largest connected component of subreddits projection as the size of the users projection, unfortunately makes the analysis very time consuming.

How to give sense to the question of diffusion?¶

Diffusion in the context of the subreddits projection can be understood as a process of spreading information through the network. The spreading of information can be modeled as a process of spreading of a virus. In this case the virus would be the information and the nodes would be the subreddits connected in the projection by the users that posted in them.

I will model the fenomenom of diffusion similarly to a contact SI model, but with a twist. In the SI model, the nodes can be in one of two states: susceptible or infected, similarly I will define two states for the subreddits:

  • S - the information could be spread to the subreddit and it cannot be spread from it,
  • I - the information can spread from it to other subreddits.

The difference between the SI model and the model I will use is that in the SI model the probability of spreading the virus is constant, but in my approach probability of spreading the information decreases with time as it is more likely that an user would see a new post than an old one. The probability will also depend on the number of common users between the subreddits. I have also assumed that the "infection" in the subreddit that already has been exposed to the spread of the information can also be "refreshed" (the information could be re-spread to it).

Note that sometimes I will refer to the information as a virus, and to the nodes to which the information was spread as "infected" but it is only for the sake of simplicity.

Preparation¶

In [17]:
# Create the network
G_subs_lcc = nx.from_pandas_edgelist(sub_sub_common_users_subs_lcc, source="subreddit_1", target="subreddit_2", edge_attr="common_users")

# Change common_users attribute to the number of common users
for edge in G_subs_lcc.edges(data=True):
    edge[2]["common_users"] = len(edge[2]["common_users"])

To visualize the network I have exported the network from Cytoscape and extracted the node positions to csv file. That way we can use good-looking layout with the nodes grouped according to their communities.

In [18]:
node_pos = pd.read_csv(os.path.join(NETWORKS_PATH, "subreddits_lcc_communities_node_collection_from_cyto.csv"))
node_pos = node_pos.set_index("subreddit")

# Convert the dataframe to a dictionary and to a format that can be used by networkx
node_pos_dict = node_pos.to_dict(orient="index")
node_pos_dict = {k: (v["x"], v["y"]) for k, v in node_pos_dict.items()}

Let's draw the network to see if the positions are correct.

In [19]:
plt.figure(figsize=(15, 10))
plt.gca().set_aspect("equal", adjustable="box")

nx.draw(
    G_subs_lcc,
    pos=node_pos_dict,
    node_size=10,
    width=0.1,
    with_labels=False,
)

plt.show()

Let's define the function that calculates the probability of spreading the information between two subreddits.

The function that I chose is exponential decay function. It is a function that is often used to model the decay of a quantity over time. In this case the quantity is the probability of spreading the information between two subreddits and the time is the time since the information was posted to the subreddit.

I don't know if it is the best function for this purpose, but it seems to be a good fit.

In [37]:
MAX_PROBABILITY_OF_SPREAD = 0.5
MAX_COMMON_USERS = subs_lcc_data["num_users"].max()

def probability_of_spread(num_common_users: int, time_from_post: int,):
    return MAX_PROBABILITY_OF_SPREAD * math.exp(-time_from_post / 20) * (num_common_users / MAX_COMMON_USERS)


nums_users = [1, 5, 10, 20, 50, 90]
plt.figure(figsize=(12, 8))

for num_users in nums_users:
    plt.plot(
        [probability_of_spread(num_users, time_from_post) for time_from_post in range(50)],
        label=f"Number of common users: {num_users}"
    )

plt.legend()
plt.title("Probability of spread for different number of common users")
plt.xlabel("Time from post")
plt.ylabel("Probability of spread")
plt.show()

Simulation¶

In [23]:
ANIMATION_PATH = os.path.join(os.getcwd(), "diffusion_animation")

Let's define the function that assigns the colors to the nodes according to their states. Blue nodes are the ones that have not had the information spread to them yet, and red nodes are the ones that have the information spread to them. The red nodes are getting lighter with time, as the probability of spreading the information decreases.

In [81]:
MAX_TIME_FROM_POST = 50

def get_node_color(frame, t_spread): 
    if t_spread == None:
        return "blue"
    
    time_from_post = frame - t_spread
    time_from_post = time_from_post / MAX_TIME_FROM_POST
    return plt.cm.autumn(time_from_post)

# Plot color legend
plt.figure(figsize=(7, 1))
for i in range(100):
    plt.scatter(i, 0, color=get_node_color(i, 0))
    plt.gca().axes.get_yaxis().set_visible(False)

plt.title("Node color with respect to time from the last spread to it")
plt.xlabel("Time from spread")
plt.show()

Let's run the simulation for 1000 iterations.

In [83]:
NUM_FRAMES = 1000

# None of the nodes have been infected yet
for node in G_subs_lcc.nodes():
    G_subs_lcc.nodes[node]["t_spread"] = None

# Choose a random node to start the spread
staring_node = np.random.choice(list(G_subs_lcc.nodes()))
G_subs_lcc.nodes[staring_node]["t_spread"] = 0

num_nodes_spread = []

for frame in tqdm(range(NUM_FRAMES)):
    for node in G_subs_lcc.nodes():
        t_spread = G_subs_lcc.nodes[node]["t_spread"]

        # If the node hasn't been infected, do nothing
        if t_spread is not None:
            neighbors = G_subs_lcc.neighbors(node)
            for neighbor in neighbors:
                common_users = G_subs_lcc.edges[node, neighbor]["common_users"]
                p_spread = probability_of_spread(common_users, frame - t_spread)
                if np.random.random() < p_spread:
                    # Spread the infection
                    G_subs_lcc.nodes[neighbor]["t_spread"] = frame

    # Save the number of nodes that have been infected
    num_nodes_spread.append(len([node for node in G_subs_lcc.nodes() if G_subs_lcc.nodes[node]["t_spread"] is not None]))

    # Plot the network
    plt.figure(figsize=(10, 10))
    plt.gca().set_aspect("equal", adjustable="box")
    nx.draw(
        G_subs_lcc,
        pos=node_pos_dict,
        node_size=10,
        width=0.1,
        with_labels=False,
        node_color=[get_node_color(frame, G_subs_lcc.nodes[node]["t_spread"]) for node in G_subs_lcc.nodes()],
    )

    # Save the plot
    plt.title(f"Frame {frame}")
    plt.savefig(os.path.join(ANIMATION_PATH, f"frame_{frame}.png"))

    # Show the plot every 50 frames
    if frame % 10 == 0:
        clear_output(wait=True)
        plt.show()

    plt.clf()
    plt.close()
                
100%|██████████| 1000/1000 [35:34<00:00,  2.13s/it]

Let's combine the frames into a video.

In [49]:
IMG_PATH = os.path.join(os.getcwd(), "img")
In [84]:
filenames = []

for filename in os.listdir(ANIMATION_PATH):
    filenames.append(os.path.join(ANIMATION_PATH, filename))

filenames.sort(key=lambda x: int(x.split("_")[-1].split(".")[0]))

mp4_writer = imageio.get_writer(os.path.join(IMG_PATH, "diffusion_animation.mp4"), fps=15)
for filename in tqdm(filenames):
    mp4_writer.append_data(imageio.imread(filename))

mp4_writer.close()
  0%|          | 0/1000 [00:00<?, ?it/s]C:\Users\steci\AppData\Local\Temp\ipykernel_17996\3936832062.py:10: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  mp4_writer.append_data(imageio.imread(filename))
IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (1000, 1000) to (1008, 1008) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).
100%|██████████| 1000/1000 [00:54<00:00, 18.50it/s]

Results¶

Diffusion animation is available in the img folder (here).

Let's plot the part of the network to which the information has spread over time.

In [92]:
num_nodes_spread_np = np.array(num_nodes_spread)
num_nodes_spread_np = num_nodes_spread_np / G_subs_lcc.number_of_nodes()

plt.figure(figsize=(12, 8))
plt.plot(num_nodes_spread_np, label="Infected nodes")
plt.plot(1 - num_nodes_spread_np, label="Uninfected nodes")
plt.title("Part of the network to which the information has spread over time")
plt.xlabel("Time")
plt.ylabel("Part of the network")
plt.legend()
plt.show()

print(F"Part of the network infected after {NUM_FRAMES} frames: {round(num_nodes_spread_np[-1], 4)}")
Part of the network infected after 1000 frames: 0.9826

We can see that the results somewhat resemble the ones obtained with classic SI models. 1000 iterations was not enough to spread the information to the whole network with the parameters I have chosen. We can also see that the information needed quite some time to start spreading, but when it did, it spread very quickly. This probably heavily depends on the starting node that was chosen.

Further improvements¶

Other analysis that could be performed in the future:

  • implement some kind of cooldown for the nodes that have already been exposed to the information, that would make the model more realistic,
  • try other functions for calculating the probability of spreading the information,
  • run the simulation for more iterations to see if the information would spread to the whole network,
  • take into account the communities of the subreddits when calculating the probability of spreading the information.

Conclusions¶

In conclusion, this project delved into the fascinating realm of users and subreddits interactions within a bipartite Reddit network. Through comprehensive analysis, various aspects of this network were explored.

By examining the node degree distributions, we gained insights into the patterns of user-subreddit connections, identifying hubs and peripheral nodes within the network.

Centralities helped us understand the importance and influence of individual nodes, highlighting key users and subreddits that play significant roles in information dissemination and interaction dynamics.

The identification of communities within the network shed light on the thematic clusters and groupings of subreddits, enabling a deeper understanding of the social dynamics and shared interests present in the Reddit community.

Additionally, the assessment of network robustness provided valuable insights into the network's resilience to node or link removal, informing strategies for network optimization and stability.

Exploring diffusion processes within the network allowed us to investigate how information or influence spreads among users and subreddits.

Moreover, comparing the network with various random network models provided a benchmark for evaluating its structural characteristics, highlighting its uniqueness and uncovering specific properties that differentiate it from random structures.

Overall, this project presents a comprehensive analysis of a users-subreddits bipartite network, uncovering valuable insights into its structure and dynamics. The findings contribute to the growing field of social network analysis, offering a deeper understanding of online communities, information flow, and social interactions within the Reddit platform.

I hereby declare that all of my code, text, and figures were produced by myself.