Paras Sharma - Blog

Web Scraping-Part 3

27 May 2017 • Paras Sharma

In WebScraping: Part 2 I showed you how Twitter user’s followers data can be extracted. To do that we used requests and BeutifulSoup. So in this post, I will show you how to extract data by using python default libraries instead of reauests and BeautifulSoup.

To do this I will be using urllib and requests module and I will be extracting images from famous webcomic xkcd.

First import modules.

import urllib.request
import urllib.parse
import re
import os

I will be getting a number of images from the websites. So to do that I will be using this URL https://c.xkcd.com/random/comic/. This URL everytime open up a random image.

url = "https://c.xkcd.com/random/comic/"

dir  = "images"

headers = {}

headers['User-Agent']='Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'

Above we declared variables. Headers will be used to simulate the browser type requests.

Now to send requests urllib will be used.

req = urllib.request.Request(url,headers=headers)
resp = urllib.request.urlopen(req)
respData = str(resp.read())

Now we have the source of the page. We need to extract the image URL.

respData = respData.replace("\\n","\n")
respData = respData.replace("\\t","\t")
respData = respData.replace("</html>\n\n'","</html>")
respData = respData.replace("b'<","<")
image = re.findall(r'<div id="comic">\n<img src="(.*?)"',respData)[0]

At this point, we have image URL. To fetch the image we will do the following.

os.makedirs(dir) # we create a directory to save our image.

imgname = img.split('/')[-1]
idata = urllib.request.urlretrieve("https:"+img,dir+"/"+imgname)

#we get image name i.e imgname = img.split('/')[-1]

image_data = urllib.request.urlretrieve("https:"+image,dir+"/"+imgname)

This will save an image in the dir “img”.

To make this code modular we can create a function and get multiple images we can use loops.

import urllib.request
import urllib.parse
import re
import os


def get_images(n, dir): #n is number of images, dir is directory name
    url = "https://c.xkcd.com/random/comic/"
    headers = {}
    headers['User-Agent']='Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'

    for _ in range(n):
        req = urllib.request.Request(url,headers=headers)
        resp = urllib.request.urlopen(req)
        respData = str(resp.read())
        respData = respData.replace("\\n","\n")
        respData = respData.replace("\\t","\t")
        respData = respData.replace("</html>\n\n'","</html>")
        respData = respData.replace("b'<","<")
        image = re.findall(r'<div id="comic">\n<img src="(.*?)"',respData)[0]
        if not os.path.exists(dir): #check for directory presence
                os.makedirs(dir)
        imgname = image.split('/')[-1]
        idata = urllib.request.urlretrieve("https:"+image,dir+"/"+imgname)

get_images(10, "comic_images")

That’s it. We now can download a number of images from xkcd with this python code.

I wrote this code when I was just learning to code in Python. You can find the code that I wrote on this repository https://github.com/Parassharmaa/i-scrap. This code is not very modular, if you are interested in updating this code, you can contribute to this repo.