Semaphore - ParaZz Blog

Web Scraping-Part 2

25 May 2017 • Paras Sharma

In the last post I gave an introduction to web scraping (WebScraping : Part 1)and how you can get started with scraping in python. In this post I will show how to scrap twitter data (without api) and extract the following inormation of a user:

  • Followers Count
  • Following Count
  • Tweets Count

To do this lets get started.

import requests
from bs4 import BeautifulSoup
username = "ParaaZz"
base_url = "https://twitter.com/"
data = requests.get(base_url+username)
soup = BeautifulSoup(data.text, "html.parser")

To extract the required info from the page source, we can look up that tag using dev tools in browser.

<span class="ProfileNav-value" data-count="209" data-is-compact="false">209</span>

This is sample html that we need to extract.

We can identify it using its class ProfileNav-value.

info_list = soup.find_all('span', {"class":"ProfileNav-value"})

#=> Value of info_list is:
#=> [<span class="ProfileNav-value" data-count="231" data-is-compact="false">231
#     </span>, <span class="ProfileNav-value" data-count="209" data-is-compact="false">209</span>, 
#     <span class="ProfileNav-value" data-count="116" data-is-compact="false">116</span>, 
#     <span class="ProfileNav-value" data-count="972" data-is-compact="false">972</span>, 
#     <span class="ProfileNav-value">More <span class="ProfileNav-dropdownCaret Icon Icon--caretDown"></span></span>]

Now we have list of element that contains the data, but we only need first 3 (Tweets, Following, Followers) items of the list.

info_list = info_list[:3]

final_data = {}

final_data['tweets'] = info_list[0]['data-count']
final_data['following'] = info_list[0]['data-count']
final_data['followers'] = info_list[0]['data-count']

In the above code, we go through all the extracted elements and extracts “data-count” attribute value. In the last we are left with a dictionary that contains the required data.

We can optimize the code and make it reusable by clubbing it into function.

def twitter_fetch(username):
	base_url = "https://twitter.com/"
	data = requests.get(base_url+username)
	soup = BeautifulSoup(data.text, "html.parser")
	info_list = soup.find_all('span', {"class":"ProfileNav-value"})
	info_list = info_list[:3]

	final_data = {}

	final_data['tweets'] = info_list[0]['data-count']
	final_data['following'] = info_list[0]['data-count']
	final_data['followers'] = info_list[0]['data-count']

	return final_data


twitter_fetch("paraazz")

Final Code:

import requests
from bs4 import BeautifulSoup

def twitter_fetch(username):
	base_url = "https://twitter.com/"
	data = requests.get(base_url+username)
	soup = BeautifulSoup(data.text, "html.parser")
	info_list = soup.find_all('span', {"class":"ProfileNav-value"})
	info_list = info_list[:3]

	final_data = {}

	final_data['tweets'] = info_list[0]['data-count']
	final_data['following'] = info_list[1]['data-count']
	final_data['followers'] = info_list[2]['data-count']

	return final_data


twitter_fetch("paraazz")

#==> Output: {'followers': '116', 'following': '209', 'tweets': '231'}