Web Scraping With Python

Mohammed Ouahman
6 min readNov 19, 2021

This is the best place to learn web scraping from 0 => 🦸. (Amazon Case-Study)

So you are there by the way if you are a beginner to use python don’t worry because I will present concepts in an easy and fun way and I encourage you to play with code to get a better grasp of concepts.

Learn what web scraping is and how it can be achieved with the help of Python’s beautiful soup library. Learn by using Amazon website data.

In the time when the internet is rich with so much data, and apparently, data has become the new oil, web scraping has become even more important and practical to use in various applications. Web scraping deals with extracting or scraping the information from the website. Web scraping is also sometimes referred to as web harvesting or web data extraction. Copying text from a website and pasting it to your local system is also web scraping. However, it is a manual task. Generally, web scraping deals with extracting data automatically with the help of web crawlers. Web crawlers are scripts that connect to the world wide web using the HTTP protocol and allows you to fetch data in an automated manner.

Whether you are a data scientist, engineer, or anybody who analyzes vast amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let’s say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways.

Some of the practical applications of web scraping could be:

  • Gathering resume of candidates with a specific skill,
  • Extracting tweets from twitter with specific hashtags,
  • Lead generation in marketing,
  • Scraping product details and reviews from e-commerce websites.

Apart from the above use-cases, web scraping is widely used in natural language processing for extracting text from the websites for training a deep learning model.

Potential Challenges of Web Scraping

  • One of the challenges you would come across while scraping information from websites is the various structures of websites. Meaning, the templates of websites will differ and will be unique; hence, generalizing across websites could be a challenge.
  • Another challenge could be longevity. Since the web developers keep updating their websites, you cannot certainly rely on one scraper for too long. Even though the modifications might be minor, but they still might create a hindrance for you while fetching the data.

Hence, to address the above challenges, there could be various possible solutions. One would be to follow continuous integration & development (CI/CD) and constant maintenance as the website modifications would be dynamic.

Another more realistic approach is to use Application Programming Interfaces (APIs) offered by various websites & platforms. For example, Facebook and twitter provide you API’s specially designed for developers who want to experiment with their data or would like extract information to let’s say related to all friends & mutual friends and draw a connection graph of it. The format of the data when using APIs is different from usual web scraping i.e., JSON or XML, while in standard web scraping, you mainly deal with data in HTML format.

What is Beautiful Soup?

Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.

  • It usually saves programmers hours or days of work since it works with your favorite parsers like lxml and html5lib to provide organic Python ways of navigating, searching, and modifying the parse tree.
  • Another powerful and useful feature of beautiful soup is its intelligence to convert the documents being fetched to Unicode and outgoing documents to UTF-8. As a developer, you do not have to take care of that unless the document intrinsic doesn’t specify an encoding or Beautiful Soup is unable to detect one.
  • It is also considered to be faster when compared to other general parsing or scraping techniques.

Types of Parsers

Enough of theory, right? So, let’s install beautiful soup and start learning about its features and capabilities using Python.

As a first step, you need to install the Beautiful Soup library using your terminal or jupyter lab. The best way to install beautiful soup is via pip, so make sure you have the pip module already installed.

!pip3 install beautifulsoup4Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/site-packages (4.7.1)
Requirement already satisfied: soupsieve>=1.2 in /usr/local/lib/python3.7/site-packages (from beautifulsoup4) (1.9.5)

Importing necessary libraries

Let’s import the required packages which you will use to scrape the data from the website and visualize it with the help of seaborn, matplotlib, and bokeh.

Scraping the Amazon Best Selling Books

This URL that you are going to scrape is the following: https://www.amazon.in/gp/bestsellers/books/) The page argument can be modified to access data for each page. Hence, to access all the pages you will need to loop through all the pages to get the necessary dataset, but first, you need to find out the number of pages from the website.

To connect to the URL and fetch the HTML content following things are required:

  • Define a get_data function which will input the page numbers as an argument,
  • Define a user-agent which will help in bypassing the detection as a scraper,
  • Specify the URL to requests.get and pass the user-agent header as an argument,
  • Extract the content from requests.get,
  • Scrape the specified page and assign it to soup variable,

Next and the important step is to identify the parent tag under which all the data you need will reside. The data that you are going to extract is:

  • Book Name
  • Author
  • Rating
  • Customers Rated
  • Price

The below image shows where the parent tag is located, and when you hover over it, all the required elements are highlighted.

Similar to the parent tag, you need to find the attributes for book name, author, rating, customers rated, and price. You will have to go to the webpage you would like to scrape, select the attribute and right-click on it, and select inspect element. This will help you in finding out the specific information fields you need an extract from the sheer HTML web page, as shown in the figure below:

Note that some author names are not registered with Amazon, so you need to apply extra find for those authors. In the below cell code, you would find nested if-else conditions for author names, which are to extract the author/publication names.

The below code cell will perform the following functions:

  • Call the get_data function inside a for loop,
  • The for loop will iterate over this function starting from 1 till the number of pages+1.
  • Since the output will be a nested list, you would first flatten the list and then pass it to the DataFrame.
  • Finally, save the dataframe as a CSV file.

Reading CSV File

Now let’s load the CSV file you created and save in the above cell. Again, this is an optional step; you could even use the dataframe df directly and ignore the below step.

(100, 5)

The shape of the dataframe reveals that there are 100 rows and 5 columns in your CSV file.

Let’s print the first 5 rows of the dataset.

Comming Soon………………………..Still there/!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Mohammed Ouahman
Mohammed Ouahman

Written by Mohammed Ouahman

Data Scientist, Machine Learning Enthusiast, Passionate about E-commerce industry.

No responses yet

Write a response