Webscraping a List of Subreddits with Python

The Plan

To dip my toe into data analysis using Python, I decided to analyse Reddit and find the answer to some interesting questions, like:

  • What is the average number of upvotes and does it change for each subreddit?
  • If I wanted to get the most upvotes, which subreddit should I post to?
  • Which subreddit is the harshest (largest proportion of posts with no upvotes)?
  • Which subreddit has the most engagement (the highest number of comments per post)?
  • On average, are there more positive or negative words in the titles of posts across reddit?

Locating the Data

First of all, I needed a list of all the subreddits. Luckily, there is a website just for this which lists around 1500 subreddits – all the ones with over 50,000 subscribers… that should be plenty of data to analyse! The website looks like this:

In this picture you can only see the first 10 or so subreddits, but as there are so many, getting a list of them manually is not really an option. Besides, that wouldn’t make for much programming experience…

Using the Google Chrome page inspector (ctrl + shift + i), the html for the page can be seen. It’s a bit more complicated, but you can still see where the subreddits are at the bottom.

Webscraping the Data

The process of extracting data from the html on a website is called webscraping. To webscrape the names of the subreddits, I only actually need two rules:

  1. All of the subreddit names are within anchor tags (<a></a>) with rel = “nofollow”
  2. All of these lie within paragraph tags (<p></p>) in a div tag (<div></div>) with the class “md wiki”

Therefore, my code should search for the div tag with class equal to “md wiki” and for every paragraph tag inside it, find the anchor tags with rel=”nofollow” and extract their text.

A final thing to do is to check whether this text begins with “/r/” if so, then it’s a subreddit name – which we want.

Getting the Webpage

A GET request needs to be made to the webpage containing the list of subreddits (https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits).

A GET request basically just fetches all of the html for the page as a great big block of text which we can process in Python. To make the GET request, a library called Requests is used. The code is as follows: It’s fairly standard and I’ll be honest, I just copied it from a forum because it does exactly what I need!

from requests import get
from requests.exceptions import RequestException
from contextlib import closing

def simple_get(url: object) -> object:
    # Attempts to get the content at `url` by making an HTTP GET request.
    # If the content-type of response is some kind of HTML/XML, return the
    # text content, otherwise return None
    # :rtype: object

    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        print('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

def is_good_response(resp):
    # Returns true if the response seems to be HTML, false otherwise

    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') &gt; -1)

 

Now calling the “simple_get” function will return the html of the webpage in Python. Yay!

subreddit_list_page = 'https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits'
raw_html = simple_get(subreddit_list_page)

 

Beautiful Soup

Searching through all this html would be a difficult job if it weren’t for Beautiful Soup – a library which turns every html tag into an easily usable Python object. These objects can be searched, iterated over and it just generally makes everything a lot easier – thanks Beautiful Soup!

We can now use Beautiful Soup to apply the two webscraping rules we need to extract the list of subreddits from the html. The following bit of code takes the html, gives it to Beautiful Soup to work its magic and then iterates over the different html tags, extracting the subreddit names and putting them in a list called “subrds”.

from bs4 import BeautifulSoup   # The library used for webscraping

subreddit_list_page = 'https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits'

subrds = []     # A list to contain all of the subreddits which are scraped from the webpage

raw_html = simple_get(subreddit_list_page)      # Get the raw html
html = BeautifulSoup(raw_html, 'html.parser')   # turn it into Python objects using BeautifulSoup

wiki = html.find("div", {"class": "md wiki"})        # Get all the html within the div with class "md wiki"

for p in wiki.find_all('p'):                         # For every paragraph tag within...
    for a in p.find_all('a', {'rel': 'nofollow'}):   # If it contains an anchor (<a>) tag with rel = "nofollow...

        tag_text = str(a.text)                       # Get the text of the anchor tag

        if tag_text.startswith('/r/'):               # If it starts with "/r/"....
            print(tag_text)                          # Print it
            subrds.append(tag_text)                  # Add the subreddit name to the list "subrds"

Putting it Together

We now have the two pieces of code we need – the first piece uses a GET request to fetch the raw html from the subreddit list webpage, and the second piece uses Beautiful Soup to extract the names of the subreddits from this raw html. Finally, here is the finished code:

from bs4 import BeautifulSoup   # The library used for webscraping
from requests import get
from requests.exceptions import RequestException
from contextlib import closing

subreddit_list_page = 'https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits'

subrds = []     # A list to contain all of the subreddits which are scraped from the webpage


def simple_get(url: object) -> object:
    # Attempts to get the content at `url` by making an HTTP GET request.
    # If the content-type of response is some kind of HTML/XML, return the
    # text content, otherwise return None
    # :rtype: object

    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        print('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    # Returns true if the response seems to be HTML, false otherwise

    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)


raw_html = simple_get(subreddit_list_page)                  # Get the raw html
html = BeautifulSoup(raw_html, 'html.parser')               # turn it into Python objects using BeautifulSoup

wiki = html.find("div", {"class": "md wiki"})               # Get all the html within the div with class "md wiki"

for p in wiki.find_all('p'):                                # For every paragraph (

) tag within...
    for a in p.find_all('a', {'rel': 'nofollow'}):          # If it contains an anchor (<a>) tag with rel = "nofollow"...

        tag_text = str(a.text)                              # Get the text of the anchor tag

        if tag_text.startswith('/r/'):                      # If it starts with "/r/"....
            print(tag_text)                                 # Print it
            subrds.append(tag_text)                         # Add the subreddit name to the list "subrds"</pre>

Output 

As the code loops through each anchor tag (<a></a>), the line “print(tag_text) ” writes the name of each subreddit to the console.

I then went one step further and got it to write to a file, resulting in the following output… expand at your own peril, it’s a long list!


ListOfSubreddits
gifs
behindthegifs
gif
Cinemagraphs
WastedGifs
educationalgifs
perfectloops
highqualitygifs
gifsound
combinedgifs
retiredgif
michaelbaygifs
gifrecipes
mechanical_gifs
bettereveryloop
gifextra
slygifs
gifsthatkeepongiving
wholesomegifs
noisygifs
blackpeoplegifs
whitepeoplegifs
reactiongifs
shittyreactiongifs
chemicalreactiongifs
physicsgifs
babyelephantgifs
weathergifs
pics
PhotoshopBattles
perfecttiming
itookapicture
Pareidolia
ExpectationVSReality
dogpictures
misleadingthumbnails
FifthWorldPics
TheWayWeWere
pic
nocontextpics
mildlyinteresting
interestingasfuck
damnthatsinteresting
beamazed
gentlemanboners
prettygirls
hardbodies
girlsmirin
thinspo
goddesses
shorthairedhotties
fitandnatural
asiancuties
PhotoshopBattles
ColorizedHistory
reallifedoodles
HybridAnimals
amiugly
roastme
rateme
uglyduckling
wallpapers
wallpaper
Offensive_Wallpapers
videos
youtubehaiku
artisanvideos
DeepIntoYouTube
nottimanderic
ShowerThoughts
DoesAnybodyElse
changemyview
crazyideas
howtonotgiveafuck
tipofmytongue
quotes
casualconversation
relationship_advice
raisedbynarcissists
legaladvice
bestoflegaladvice
advice
IAmA
ExplainlikeIAmA
AMA
casualiama
de_Iama
whowouldwin
wouldyourather
scenesfromahat
AskOuija
whatisthisthing
AskReddit
answers
NoStupidQuestions
amiugly
whatsthisbug
samplesize
AskReddit
ShittyAskScience
TrueAskReddit
AskScienceFiction
AskOuija
AskScience
askhistorians
askculinary
AskSocialScience
askengineers
askphilosophy
askwomen
askmen
askgaybros
askredditafterdark
tifu
self
confession
fatpeoplestories
talesfromtechsupport
talesfromretail
techsupportmacgyver
idontworkherelady
TalesFromYourServer
KitchenConfidential
TalesFromThePizzaGuy
TalesFromTheFrontDesk
pettyrevenge
prorevenge
nosleep
LetsNotMeet
Glitch_in_the_Matrix
shortscarystories
thetruthishere
UnresolvedMysteries
UnsolvedMysteries
depression
SuicideWatch
Anxiety
foreveralone
offmychest
socialanxiety
YouShouldKnow
everymanshouldknow
LearnUselessTalents
changemyview
howto
Foodforthought
educationalgifs
UniversityofReddit
lectures
education
college
GetStudying
teachers
todayilearned
wikipedia
OutOfTheLoop
IWantToLearn
explainlikeimfive
explainlikeIAmA
ExplainLikeImCalvin
anthropology
Art
redditgetsdrawn
heavymind
drawing
graffiti
retrofuturism
sketchdaily
ArtPorn
pixelart
artfundamentals
learnart
gamedev
engineering
ubuntu
cscareerquestions
EngineeringStudents
askengineers
learnprogramming
compsci
java
javascript
coding
machinelearning
howtohack
cpp
python
learnpython
Economics
business
entrepreneur
marketing
business
smallbusiness
stocks
wallstreetbets
stockmarket
environment
historynetwork
history
AskHistorians
ColorizedHistory
badhistory
100yearsago
HistoryPorn
PropagandaPosters
TheWayWeWere
linguistics
languagelearning
learnjapanese
law
math
theydidthemath
medicalschool
psychology
Science
AskScience
cogsci
medicine
everythingscience
Space
SpacePorn
astronomy
astrophotography
spacex
nasa
biology
Awwducational
chemicalreactiongifs
chemistry
physics
entertainment
fantheories
Disney
obscuremedia
anime
manga
anime_irl
awwnime
TsundereSharks
animesuggest
pokemon
onepiece
naruto
dbz
onepunchman
ShingekiNoKyojin
yugioh
Books
WritingPrompts
writing
literature
booksuggestions
lifeofnorman
poetry
screenwriting
freeEbooks
boottoobig
hfy
suggestmeabook
comics
comicbooks
polandball
marvel
webcomics
bertstrips
marvelstudios
harrypotter
batman
calvinandhobbes
explainlikeimcalvin
lotr
xkcd
DCComics
arrow
asoiaf
gameofthrones
freefolk
celebs
onetruegod
EmmaWatson
joerogan
jessicanigri
cosplay
cosplaygirls
magicTCG
lego
boardgames
rpg
chess
poker
DnD
DnDGreentext
DnDBehindTheScreen
dndnext
zombies
cyberpunk
fantasy
scifi
starwars
startrek
asksciencefiction
prequelmemes
empiredidnothingwrong
SequelMemes
sciencefiction
InternetIsBeautiful
facepalm
wikipedia
creepyPMs
web_design
google
darknetmarkets
cynicalbrit
KenM
bannedfromclubpenguin
amazontoprated
savedyouaclick
bestofworldstar
4chan
Classic4chan
greentext
facepalm
oldpeoplefacebook
facebookwins
indianpeoplefacebook
terriblefacebookmemes
insanepeoplefacebook
Tinder
OkCupid
KotakuInAction
wikileaks
twitch
livestreamfail
serialpodcast
podcasts
tumblrinaction
tumblr
blackpeopletwitter
scottishpeopletwitter
WhitePeopleTwitter
wholesomebpt
YoutubeHaiku
youtube
gamegrumps
h3h3productions
CGPGrey
yogscast
jontron
Idubbbz
defranco
roosterteeth
funhaus
rwby
movies
documentaries
fullmoviesonyoutube
truefilm
marvelstudios
bollywoodrealism
moviedetails
moviesinthemaking
fullmoviesonvimeo
continuityporn
starwars
harrypotter
lotr
batman
DC_Cinematic
listentothis
music
listentothis
guitar
WeAreTheMusicMakers
mashups
vinyl
futurebeats
musictheory
guitarlessons
piano
spotify
bass
fakealbumcovers
kanye
radiohead
KendrickLamar
gorillaz
hiphopheads
metal
classicalmusic
jazz
trap
indieheads
gamemusic
outrun
vaporwave
dubstep
electronicmusic
edmproduction
EDM
spop
kpop
sports
running
bicycling
formula1
golf
fishing
skiing
sportsarefun
tennis
rugbyunion
discgolf
nfl
CFB
fantasyfootball
nflstreams
patriots
eagles
greenbaypackers
baseball
mlb
nba
collegebasketball
nbastreams
warriors
skateboarding
snowboarding
longboarding
MMA
squaredcircle
theocho
ufc
boxing
wwe
hockey
nhl
nhlstreams
olympics
apocalympics2016
soccer
worldcup
Bundesliga
futbol
soccerstreams
MLS
gunners
reddevils
LiverpoolFC
chelseafc
Television
marvelstudios
japanesegameshows
shield
cordcutters
GameOfThrones
BreakingBad
thewalkingdead
community
arresteddevelopment
topgear
StarTrek
HIMYM
firefly
IASIP
PandR
Sherlock
DunderMifflin
BetterCallSaul
TrueDetective
houseofcards
MakingaMurderer
FlashTV
trailerparkboys
mrrobot
siliconvalleyhbo
strangerthings
supernatural
thegrandtour
AmericanHorrorStory
rupaulsdragrace
westworld
blackmirror
FilthyFrank
orangeisthenewblack
twinpeaks
bigbrother
Pokemon
AdventureTime
futurama
TheLastAirbender
ArcherFX
southpark
TheSimpsons
mylittlepony
rickandmorty
naruto
stevenuniverse
dbz
onepunchman
BobsBurgers
BoJackHorseman
gravityfalls
doctorwho
gallifrey
seinfeld
redditwritesseinfeld
NetflixBestOf
Netflix
bestofnetflix
DIY
cosplay
woodworking
somethingimade
architecture
CoolGuides
WorldBuilding
aquariums
ifyoulikeblank
DiWHY
knitting
sewing
modelmakers
crochet
ProtectAndServe
RTLSDR
digitalnomad
FastWorkers
accounting
preppers
art
Drawing
crafts
alternativeart
sketchdaily
artporn
glitch_art
coloringcorruptions
restofthefuckingowl
DisneyVacation
Writing
screenwriting
fountainpens
calligraphy
cars
motorcycles
carporn
justrolledintotheshop
Shitty_Car_Mods
autos
roadcam
AutoDetailing
subaru
teslamotors
bmw
jeep
CrappyDesign
web_design
graphic_design
design
designporn
InteriorDesign
ATBGE
dontdeadopeninside
assholedesign
keming
actlikeyoubelong
irlsmurfing
MilitaryPorn
military
combatfootage
militarygfys
guns
gunporn
gundeals
ar15
firearms
Jobs
forhire
cscareerquestions
workonline
guitar
WeAreTheMusicMakers
edmproduction
piano
gardening
urbanexploration
survival
backpacking
camping
homestead
MTB
campingandhiking
hiking
ultralight
photography
itookapicture
Filmmakers
astrophotography
analog
photocritique
aviation
flying
sysadmin
engineering
compsci
webdev
programmerhumor
graphic_design
mechanicalkeyboards
reverseengineering
itsaunixsystem
plex
dailyprogrammer
coding
python
java
cpp
buildapc
buildapcsales
buildapcforme
talesfromtechsupport
techsupportgore
techsupport
softwaregore
iiiiiiitttttttttttt
watches
lockpicking
knives
specializedtools
travel
solotravel
LifeProTips
lifehacks
geek
battlestations
EDC
simpleliving
tinyhouses
rainmeter
vandwellers
UnethicalLifeProTips
malelifestyle
malelivingspace
TheGirlSurvivalGuide
homeimprovement
homelab
homeautomation
teenagers
introvert
ADHD
totallynotrobots
polyamory
teachers
beards
vegan
swoleacceptance
tall
lgbt
gaybros
actuallesbians
gaymers
bisexual
askgaybros
parenting
daddit
babybumps
beer
drunk
homebrewing
scotch
stopdrinking
cocktails
wine
trees
marijuana
microgrowery
eldertrees
see
leaves
drugs
electronic_cigarette
stonerengineering
Nootropics
LSD
vaporents
Vaping
stopsmoking
GetMotivated
health
ZenHabits
Medicine
LucidDreaming
meditation
Psychonaut
Fitness
Fitness
xxfitness
fitmeals
paleo
nutrition
vegetarian
leangains
HealthyFood
keto
ketorecipes
ketogains
bicycling
yoga
skateboarding
climbing
backpacking
bjj
skiing
crossfit
bodybuilding
WeightRoom
powerlifting
running
c25k
loseit
bodyweightfitness
progresspics
gainit
swoleacceptance
flexibility
makeupaddiction
SkincareAddiction
beards
wicked_edge
RedditLaqueristas
AsianBeauty
FancyFollicles
malehairadvice
curlyhair
tattoos
badtattoos
tattoo
malefashionadvice
frugalmalefashion
femalefashionadvice
thriftstorehauls
fashion
streetwear
malefashion
supremeclothing
FashionReps
sneakers
repsneakers
food
FoodPorn
foodhacks
shittyfoodporn
eatsandwiches
nutrition
mealtimevideos
WeWantPlates
cooking
slowcooking
askculinary
baking
mealprepsunday
breadit
cookingforbeginners
EatCheapAndHealthy
fitmeals
budgetfood
ketorecipes
vegan
1200isplenty
Cheap_Meals
HealthyFood
veganrecipes
coffee
tea
recipes
gifrecipes
veganrecipes
pizza
grilledcheese
ramen
bbq
PersonalFinance
investing
Entrepreneur
beermoney
startups
finance
economy
financialindependence
apphookup
millionairemakers
churning
realestate
flipping
frugal
EatCheapAndHealthy
frugalmalefashion
budgetfood
cheap_meals
Frugal_Jerk
shutupandtakemymoney
BuyItForLife
crappyoffbrands
INEEEEDIT
shouldibuythisgame
Anticonsumption
Bitcoin
dogecoin
CryptoCurrency
ethereum
ethtrade
litecoin
btc
Psychonaut
Buddhism
Stoicism
occult
atheism
trueatheism
Christianity
dankchristianmemes
exmormon
philosophy
askphilosophy
relationships
socialskills
relationship_advice
socialengineering
dating_advice
weddingplanning
Parenting
childfree
raisedbynarcissists
incest
daddit
justnomil
Tinder
OKCupid
r4r
dirtyr4r
sex
seduction
nofap
theredpill
deadbedrooms
polyamory
GetMotivated
QuotesPorn
getdisciplined
happy
productivity
DecidingToBeBetter
mademesmile
selfimprovement
iwantout
humansbeingbros
happycrowds
sportsarefun
GetStudying
technology
technology
internetisbeautiful
futurology
pcmasterrace
buildapc
talesfromtechsupport
netsec
gamedev
design
engineering
jailbreak
compsci
tech
hacking
imaginarytechnology
privacy
torrents
networking
infographics
3Dprinting
piracy
EngineeringPorn
cableporn
simulated
onions
unixporn
crackwatch
php
nintendo
spacex
nasa
amd
nvidia
photoshop
Android
AndroidApps
AndroidGaming
AndroidDev
AndroidThemes
oneplus
apple
iphone
mac
ipad
applewatch
gadgets
Android
raspberry_pi
iphone
electronics
arduino
trackers
gopro
Addons4Kodi
blender
kodi
hardware
hardwareswap
google
chromecast
googlepixel
linux
linux_gaming
linux4noobs
Windows10
windows
excel
surface
dataisbeautiful
DataHoarder
Bitcoin
dogecoin
CryptoCurrency
ethereum
ethtrader
btc
litecoin
bitcoinmarkets
programming
learnprogramming
python
java
javascript
learnpython
excel
unity3d
audiophile
headphones
audioengineering
funny
humor
contagiouslaughter
standupcomedy
ProgrammerHumor
prematurecelebration
ChildrenFallingOver
dadreflexes
kenm
politicalhumor
accidentalcomedy
ComedyCemetery
funnyandsad
Jokes
dadjokes
standupshots
punny
antijokes
meanjokes
3amjokes
puns
WordAvalanches
Demotivational
lolcats
supershibe
copypasta
emojipasta
TrollXChromosomes
trollychromosome
starterpacks
AdviceAnimals
memes
trippinthroughtime
BikiniBottomTwitter
dankmemes
madlads
bidenbro
memeeconomy
rarepuppers
wholesomememes
dankchristianmemes
terriblefacebookmemes
prequelmemes
dank_meme
trebuchetmemes
deepfriedmemes
Overwatch_Memes
see
SequelMemes
surrealmemes
bonehurtingjuice
me_irl
meirl
anime_irl
2meirl4meirl
meow_irl
woof_irl
TooMeIrlForMeIrl
fffffffuuuuuuuuuuuu
iiiiiiitttttttttttt
AnimalReddits
AnimalsBeingJerks
AnimalsBeingBros
AnimalPorn
AnimalsBeingDerps
likeus
stoppedworking
hitmanimals
animaltextgifs
BeforeNAfterAdoption
sneks
TsundereSharks
whatsthisbug
HybridAnimals
zoomies
brushybrushy
birdswitharms
superbowl
birbs
babyelephantgifs
sloths
foxes
trashpandas
cats
CatSubs
startledcats
catpictures
catsstandingup
catpranks
meow_irl
holdmycatnip
catslaps
thecatdimension
babybigcatgifs
catloaf
thisismylifemeow
cattaps
dogs
corgi
dogpictures
dogtraining
woof_irl
WhatsWrongWithYourDog
dogberg
conspiracy
skeptic
karmaconspiracy
UFOs
conspiratard
empiredidnothingwrong
scp
cringepics
cringe
instant_regret
blunderyears
facepalm
fatlogic
publicfreakout
cringeanarchy
lewronggeneration
fellowkids
sadcringe
corporatefacepalm
4PanelCringe
amibeingdetained
instantbarbarians
facepalm
quityourbullshit
thathappened
delusionalartists
oopsdidntmeanto
beholdthemasterrace
ihavesex
niceguys
iamverysmart
justneckbeardthings
iamverybadass
mallninjashit
ChoosingBeggars
gatekeeping
aww
cats
animalsbeingjerks
animalsbeingbros
Awwducational
dogs
corgi
thisismylifenow
blep
eyebeach
tippytaps
WTF
DeepIntoYouTube
fifthworldproblems
awwwtf
wellthatsucks
streetfights
yesyesyesyesno
wtfstockphotos
mildlyinfuriating
crappydesign
rage
Bad_Cop_No_Donut
gifsthatendtoosoon
imgoingtohellforthis
toosoon
trashy
awfuleyebrows
nosleep
creepy
morbidreality
whatcouldgowrong
Glitch_in_the_Matrix
Paranormal
nononono
horror
shortscarystories
creepypasta
lastimages
peoplefuckingdying
serialkillers
WhyWereTheyFilming
ImaginaryMonsters
ImaginaryLeviathans
ImaginaryMindscapes
imaginarycharacters
thalassophobia
TheDepthsBelow
submechanophobia
freebies
fullmoviesonyoutube
efreebies
randomactsofgaming
freeEbooks
fullmoviesonvimeo
freegamesonsteam
googleplaydeals
megalinks
opendirectories
Random_Acts_Of_Pizza
coupons
dealsreddit
MaleFashionAdvice
everymanshouldknow
askmen
frugalmalefashion
MensRights
malelifestyle
trollychromosome
malelivingspace
malehairadvice
malefashion
TwoXChromosomes
askwomen
LadyBoners
TrollXChromosomes
femalefashionadvice
xxfitness
TheGirlSurvivalGuide
abrathatfits
badwomensanatomy
LocationReddits
MapPorn
polandball
vexillology
europe
ireland
thenetherlands
france
denmark
italy
norge
polska
de
suomi
romania
belgium
ANormalDayInRussia
youseecomrade
sweden
SWARJE
swedishproblems
intresseklubben
svenskpolitik
spop
Allsvenskan
unitedkingdom
britishproblems
london
ukpolitics
canada
toronto
vancouver
mexico
MURICA
floridaman
nyc
chicago
seattle
portland
boston
atlanta
washingtondc
denver
philadelphia
losangeles
sanfrancisco
bayarea
california
austin
houston
texas
australia
newzealand
melbourne
Philippines
india
kpop
pyongyang
singapore
japan
japanpics
brasil
argentina
OutOfTheLoop
SubredditDrama
nocontext
tldr
modnews
Enhancement
SecretSanta
MuseumOfReddit
theoryofreddit
threadkillers
evenwithcontext
beetlejuicing
announcements
blog
beta
baconreader
alienblue
baconit
redditsync
relayforreddit
Circlejerk
DiWHY
frugal_jerk
ShitRedditSays
karmaconspiracy
undelete
jesuschristreddit
karmacourt
titlegore
ShitAmericansSay
bestof
DepthHub
BestOfReports
bestoflegaladvice
subredditoftheday
wowthissubexists
newreddits
ofcoursethatsathing
findareddit
subredditsimulator
subredditsimmeta
TrueReddit
awesome
TipOfMyTongue
TipOfMyPenis
woahdude
frisson
asmr
VaporwaveAesthetics
earthporn
hardcoreaww
hitmanimals
natureisfuckinglit
heavyseas
marijuanaenthusiasts
succulents
mycology
bonsai
natureismetal
Natureisbrutal
worldnews
news
nottheonion
UpliftingNews
offbeat
gamernews
floridaman
energy
syriancivilwar
uncensorednews
Politics
Politics
worldpolitics
Libertarian
anarchism
socialism
conservative
politicalhumor
neutralpolitics
politicaldiscussion
ukpolitics
latestagecapitalism
geopolitics
the_donald
HillaryForPrison
wikileaks
MensRights
feminism
SandersForPresident
bidenbro
political_revolution
thanksobama
esist
enoughtrumpspam
MarchAgainstTrump
TinyTrumps
TrumpCriticizesTrump
trumpgret
OldSchoolCool
TheWayWeWere
nostalgia
vinyl
forwardsfromgrandma
firstworldanarchists
wheredidthesodago
unexpectedthuglife
youdontsurf
montageparodies
outside
OSHA
hailcorporate
im14andthisisdeep
bollywoodrealism
AccidentalRenaissance
maliciouscompliance
fakehistoryporn
coaxedintoasnafu
sfwpornnetwork
EarthPorn
HistoryPorn
FoodPorn
JusticePorn
AbandonedPorn
SpacePorn
RoomPorn
QuotesPorn
MapPorn
CityPorn
carporn
humanporn
penmanshipporn
militaryporn
DesignPorn
ThingsCutInHalfPorn
ArchitecturePorn
ExposurePorn
futureporn
adrenalineporn
waterporn
machineporn
animalporn
movieposterporn
illusionporn
destructionporn
adporn
artefactporn
gunporn
skyporn
powerwashingporn
ArtPorn
InfrastructurePorn
VillagePorn
shockwaveporn
shittyaskscience
shittyfoodporn
shittyreactiongifs
crappydesign
Shitty_Car_Mods
shittyadvice
shittyrobots
ShittyLifeProTips
shittykickstarters
unexpected
UnexpectedThugLife
misleadingthumbnails
unexpectedjihad
slygifs
blackmagicfuckery
AbandonedPorn
OddlySatisfying
RoomPorn
nonononoyes
minimalism
CityPorn
penmanshipporn
Cinemagraphs
ImaginaryLandscapes
eyebleach
DesignPorn
perfectloops
perfectfit
humansbeingbros
powerwashingporn
nevertellmetheodds
typography
cozyplaces
breathinginformation
desirepath
tiltshift
mostbeautiful
AmateurRoomPorn
slygifs
raining
AccidentalWesAnderson
unstirredpaint
holdmybeer
holdmyjuicebox
holdmyfries
holdmybeaker
holdmycosmo
holdmycatnip
holdmyredbull
fiftyfifty
firstworldproblems
idiotsfightingthings
whatsinthisthing
AskReddit
notinteresting
fifthworldpics
drunkorakid
pussypassdenied
UNBGBBIIVCHIDCTIICBG
Justfuckmyshitup
BestOfStreamingVideo
CatastrophicFailure
evilbuildings
justiceserved
mypeopleneedme
notmyjob
sweatypalms
therewasanattempt
bitchimabus
greendawn
thingsforants
youseeingthisshit
hmmm
hadtohurt
MandelaEffect
mildlypenis
redditdayof
idiotsincars
instantkarma
2healthbars
collapse
slavs_squatting
confusing_perspective
the_pack
reddit.com
thefappening
fatpeoplehate
thebutton
place
TheBestOfAmazon
mindcrack
twitchplayspokemon
battlefield3
punchablefaces
csgobetting
historicalwhatif

Conclusion

As intended, the output from this piece of code is a list of all the subreddits, scraped from raw html. I’ve never done anything like this before but it’s very rewarding, if a bit messy.

With webscraping, the possibilities are endless – just remember to be polite and not spam the websites too much. That’s why I saved the list to a file – so I only had to run this script once, and can just pull the data from a file for the rest of the reddit analysis, without disturbing the reddit survers too much.

I hope you found this post helpful and a friendly introduction to the world of webscraping. I’d be delighted to hear your thoughts on this in the comments – let me know what you think!

Happy scraping,

Robin

Leave a Reply