visit
I was trying to download videos from Reddit and found that the browser extensions available were either paid or not working. But, thankfully there were some websites that did work just great. I tried that for quite a while until I found it boring to open the website every once and then I felt a need to download something. So, I thought why not create a Python script for the same which could be easily passed on the link and have the file downloaded.
Reddit has a different way of storing videos so as to make it harder to download (but we will anyways). It stores the video without the audio at one place and the audio at another URL and when we use the Reddit player it loads and plays both of these simultaneously. So, we will download both of these and stitch them with ffmpeg.
# imports
import subprocess
import json
from bs4 import BeautifulSoup
import requests
import sys
# getting a response using the URL
url = sys.argv[1] # gets the url passed in the command-line
headers = {'User-Agent':'Mozilla/5.0'}
response = requests.get(url,headers = headers)
# finding the post id for the Reddit post
post_id = url[url.find('comments/') + 9:]
post_id = f"t3_{post_id[:post_id.find('/')]}"
I googled it and found that a JSON file can simply be obtained from a Reddit link appending .json at the end of each Reddit link and the video URL’s could be easily grabbed from there. But, I decided to dig into the original HTML code and find the script tag with the data. And I found it. It was in a script tag with the id attribute set to ‘data’. Let’s find extract that using BeautifulSoup.
# processing the response to find the data
if(response.status_code == 200): # checking if the server responded with OK
soup = BeautifulSoup(response.text,'lxml')
# I looked up the original code of the reddit page
# to find where all the data was and it was in a script tag
# with the id set to 'data'
required_js = soup.find('script',id='data')
json_data = json.loads(required_js.text.replace('window.___r = ','')[:-1])
# 'window.___r = ' and a semicolon at the end of the text were removed
# to get the data as json
title = json_data['posts']['models'][post_id]['title']
title = title.replace(' ','_')
dash_url = json_data['posts']['models'][post_id]['media']['dashUrl']
height = json_data['posts']['models'][post_id]['media']['height']
dash_url = dash_url[:int(dash_url.find('DASH')) + 4]
# the dash URL is the main URL we need to search for
# height is used to find the best quality of video available
video_url = f'{dash_url}_{height}.mp4' # this URL will be used to download the video
audio_url = f'{dash_url}_audio.mp4' # this URL will be used to download the audio part
# downloading the video and audio files
with open(f'{title}_video.mp4','wb') as file:
print('Downloading Video...',end='',flush = True)
response = requests.get(video_url,headers=headers)
if(response.status_code == 200):
file.write(response.content)
print('\rVideo Downloaded...!')
else:
print('\rVideo Download Failed..!')
with open(f'{title}_audio.mp3','wb') as file:
print('Downloading Audio...',end = '',flush = True)
response = requests.get(audio_url,headers=headers)
if(response.status_code == 200):
file.write(response.content)
print('\rAudio Downloaded...!')
else:
print('\rAudio Download Failed..!')
# using ffmpeg to stitch the video and audio into one
subprocess.call(['ffmpeg','-i',f'{title}_video.mp4','-i',f'{title}_audio.mp3','-map','0:v','-map','1:a','-c:v','copy',f'{title}.mp4'])
subprocess.call(['rm',f'{title}_video.mp4',f'{title}_audio.mp3'])
We have our video downloaded successfully and the other downloaded files too are trashed. I didn’t explain a lot of things in detail but I hope this article does really get you interested towards learning Web Scraping. I’ll also recommend learning the ffmpeg tool. Wish you a happy coding journey! 🙂🙂🙂