In this tutorial, you will learn how to build a web scraper using Python. You will scrape stack overflow to get questions along with their stats.
In this tutorial, you will learn how to build a web scraper using Python. You will scrape stack overflow to get questions along with their stats.
Python is a high-level programming language designed to be easy to read and simple to implement. It is open source, which means it is free to use, even for commercial applications.
Web scraping is a technique used to extract data from websites. Data displayed by most websites can only be viewed using a web browser. They do not offer the functionality to save a copy of this data for personal use. The only option then is to manually copy and paste the data – a very tedious job which can take many hours or sometimes days to complete. Web Scraping is the technique of automating this process so that instead of manually copying the data from websites, the Web Scraping algorithm will perform the same task within a fraction of the time.
NB: Before you scrape a site, please check their terms and conditions to be sure it isn’t illegal. An example can be found when Bidder’s Edge was sued by ebay for scraping here.
Python in this piece refers to Python 3.x versions.
You will use two important libraries while dealing with web scraping: requests
and beautifulsoup
The requests
library will make a GET
request to a web server, which will download the HTML contents of a web page for us.
The beautifulsoup
library will parse the HTML
and also extract information from it.
To install these libraries, run:
1pip install requests bs4
There are basically 3 steps to web scraping:
Fetching a site’s content is straightforward using Python. It is as easy as just performing a GET
request. For example, look at the code below:
1import requests 2 site = requests.get('https://stackoverflow.com/');
In the code above, you have imported the requests
library, and used the GET
function to fetch the site to be scraped. The variable called site
would now contain a response
object.
To check if the GET
request was successful before performing any actions, you can check the status code:
1if site.status_code is 200: 2 print(site.content)
Since the response
object returns a lot of properties such as status_code
, content
, headers
etc. , you can always use the status code as a condition to decide whether to parse the response or not.
To find out more about the various properties exposed by the response
object, you can check the official docs here.
Now you have your requests
library working as it should, it’s time to parse the content of the response
and extract the information needed from the site.
There are two ways to extract data from the response
object available, which are:
find
and find_all
functions.CSS Selectors
BeautifulSoup
objects support searching a page via CSS selectors using the select
method. You can use CSS selectors to find all the questions on the stack overflow home page like this:
1from bs4 import BeautifulSoup 2 content = BeautifulSoup(site.content, 'html.parser') 3 questions = content.select('.question-summary')
If you look at the code block above, you notice you have imported BeautifulSoup
library, and used it to parse the site’s content using the html parser
. While there are third party parsers that can be installed and configured, I will stick to the default HTML parser for this piece.
One other thing you must ask now is: Where did the class question-summary
come from?.
To use CSS
selectors, or even the find
and find_all
methods of BeautifulSoup, you have to know the structure of the HTML that holds the element you want to draw information from. A good method to do this is to inspect the element you want, and get its class from the developer tools. In our case, every question is wrapped in a class called question-summary
.
Next, you will get the topic, url, views, answers and votes for each question. Look at the code below:
1for question in questions: 2 topic = question.select( '.question-hyperlink')[0].get_text() 3 url = question.select( '.question-hyperlink')[0].get('href') 4 views = question.select('.views .mini-counts span')[0].get_text() 5 answers = question.select('.status .mini-counts span')[0].get_text() 6 votes = question.select('.votes .mini-counts span')[0].get_text()
If you take a look at the code above, you should notice 3 main things:
0
to the response of the select method. This is because the select method always returns a list even if it has just one response.get_text()
method: This method is used to get the text / innerHTML of a single element.get(``'``href``'``)
method: The get
method can get any attribute from an HTML element. Here, I wanted to get the href
attribute.If you print each topic, URL, views, answers, and votes to the terminal, you notice that the information printed tallies with the information on the website.
Using the find
and find_all
functions
Another method with which you can easily parse an HTML page will be to use the find
and find_all
methods. This two methods can also get you any information you want from a webpage, as it allows you to find by tag, id or even class name. Interesting? First, you will need to import the BeautifulSoup
library and initialize it with the HTML parser:
1from bs4 import BeautifulSoup 2 content = BeautifulSoup(site.content, 'html.parser') 3 questions = content.find_all(class_='question-summary')
If you look at the code block above, you notice only the third line is different from the snippet we have in the CSS selectors section.
In the line where you defined the variable called questions
, you will notice it is similar to what you have done in the CSS selectors, except that you called the find_all
method. You will also notice class_
. This argument is passed to the find_all
method tells the method you want to find all the elements that have the class passed to it. Alternatively, if you want to find all elements with an ID, you will pass the id_
argument instead.
Next, you will get the topic, URL, views, answers, and votes for each question. Look at the code below:
1for question in questions: 2 topic = question.find(class_='question-hyperlink').get_text() 3 url = question.find(class_='question-hyperlink').get('href') 4 views = question.find(class_='views').find(class_='mini-counts').find('span').get_text() 5 answers = question.find(class_='status').find(class_='mini-counts').find('span').get_text() 6 votes = question.find(class_='votes').find(class_='mini-counts').find('span').get_text()
Looking at the code block above, you will notice:
get
and get_text
methods were also used here as it is not peculiar to CSS selectors only.NOTE: While in the two examples above, CSS selectors achieved its aim by using only CSS selectors vis-a-vis the **find**
and **find_all**
methods, we can combine the two ways together as seen below:
1questions = content.select('.question-summary') 2 for question in questions: 3 topic = question.find(class_='question-hyperlink').get_text()
After all, is seen and done, it would be nice to see how each method will look like at a full glance. Here is what the CSS selector method will look like:
1import requests 2 from bs4 import BeautifulSoup 3 site = requests.get('https://stackoverflow.com/'); 4 if site.status_code is 200: 5 content = BeautifulSoup(site.content, 'html.parser') 6 questions = content.select('.question-summary') 7 for question in questions: 8 topic = question.select( '.question-hyperlink')[0].get_text() 9 url = question.select( '.question-hyperlink')[0].get('href') 10 views = question.select('.views .mini-counts span')[0].get_text() 11 answers = question.select('.status .mini-counts span')[0].get_text() 12 votes = question.select('.votes .mini-counts span')[0].get_text()
Here is what the find
and find_all
method will look like:
1import requests 2 from bs4 import BeautifulSoup 3 site = requests.get('https://stackoverflow.com/'); 4 if site.status_code is 200: 5 content = BeautifulSoup(site.content, 'html.parser') 6 questions = content.find_all(class_='question-summary') 7 for question in questions: 8 topic = question.find(class_='question-hyperlink').get_text() 9 url = question.find(class_='question-hyperlink').get('href') 10 views = question.find(class_='views').find(class_='mini-counts').find('span').get_text() 11 answers = question.find(class_='status').find(class_='mini-counts').find('span').get_text() 12 votes = question.find(class_='votes').find(class_='mini-counts').find('span').get_text()
The real motive behind scraping any site is to save the information somewhere. It might be a local database such as MySQL, a JSON file or even a CSV document. Here, you will save the information into a CSV file.
The easiest way to have the parsed data saved into a CSV file will be to create an empty list, append to the empty list as we scrape, and then at the end, write the list of data into the CSV file. Take a look at this:
1import csv 2 import requests 3 from bs4 import BeautifulSoup 4 data_list=[] 5 site = requests.get('https://stackoverflow.com/'); 6 if site.status_code is 200: 7 content = BeautifulSoup(site.content, 'html.parser') 8 questions = content.select('.question-summary') 9 for question in questions: 10 topic = question.select( '.question-hyperlink')[0].get_text() 11 url = question.select( '.question-hyperlink')[0].get('href') 12 views = question.select('.views .mini-counts span')[0].get_text() 13 answers = question.select('.status .mini-counts span')[0].get_text() 14 votes = question.select('.votes .mini-counts span')[0].get_text() 15 new_data = {"topic": topic, "url": url, "views": views, "answers":answers, "votes":votes} 16 data_list.append(new_data) 17 with open ('selector.csv','w') as file: 18 writer = csv.DictWriter(file, fieldnames = ["topic", "url", "views", "answers", "votes"], delimiter = ';') 19 writer.writeheader() 20 for row in data_list: 21 writer.writerow(row)
If you look at the code block above, you notice it is similar to the version of the CSS selector code except that:
data_list
was declared at the beginning of our code.data_list
.DictWriter
function of the CSV library to create headers for our data.data_list
into the CSV file.Alternatively, here is the end product for the one with find
and find_all
methods:
1import csv 2 import requests 3 from bs4 import BeautifulSoup 4 data_list=[] 5 site = requests.get('https://stackoverflow.com/'); 6 if site.status_code is 200: 7 content = BeautifulSoup(site.content, 'html.parser') 8 questions = content.find_all(class_='question-summary') 9 for question in questions: 10 topic = question.find(class_='question-hyperlink').get_text() 11 url = question.find(class_='question-hyperlink').get('href') 12 views = question.find(class_='views').find(class_='mini-counts').find('span').get_text() 13 answers = question.find(class_='status').find(class_='mini-counts').find('span').get_text() 14 votes = question.find(class_='votes').find(class_='mini-counts').find('span').get_text() 15 new_data = {"topic": topic, "url": url, "views": views, "answers":answers, "votes":votes} 16 data_list.append(new_data) 17 with open ('find.csv','w') as file: 18 writer = csv.DictWriter(file, fieldnames = ["topic", "url", "views", "answers", "votes"], delimiter = ';') 19 writer.writeheader() 20 for row in data_list: 21 writer.writerow(row)
In this little piece, you found out how to scrape data from a website. You have also learned that it is illegal to scrape some sites, and you should check their terms and conditions before scraping. You have also learned about CSS selectors as well as the find
and find_all
methods of the BeautifulSoup
library. Finally, you discovered how to save these data into a CSV file.
The code base to this tutorial is available here.