Let’s say you are trying to extract a table from a webpage, convert it to a JSON file and use the JSON file for building some internal tools. With the help of web scraping, you can extract the data you want by targeting the specific elements in a webpage. Web scraping using Python is a very popular choice as Python provides multiple libraries like BeautifulSoup, or Scrapy to extract data effectively. Having the skill of extracting data efficiently is also very important as a developer or a data scientist. This article will help you understand how to scrape a website effectively and get the necessary content to manipulate it according to your need. For this tutorial, we’ll be using the BeautifulSoup package. It is a trendy package for scraping data in Python.
Why use Python for Web Scraping?
Python is the first choice for many developers when building web scrapers. There are many reasons why Python is the first choice, but for this article, let’s discuss three top reasons why Python is used for data scraping. Library and Community Support: There are several great libraries, like BeautifulSoup, Scrapy, Selenium, etc., that provide great functions for effectively scraping web pages. It has built an excellent ecosystem for web scraping, and also because many developers worldwide already use Python, you can quickly get help when you are stuck. Automation: Python is famous for its automation capabilities. More than web scraping is required if you are trying to make a complex tool that relies on scraping. For example, if you want to build a tool that tracks the price of items in an online store, you’ll need to add some automation capability so that it can track the rates daily and add them to your database. Python gives you the ability to automate such processes with ease. Data Visualization: Web scraping is heavily used by data scientists. Data scientists often need to extract data from web pages. With libraries like Pandas, Python makes data visualization simpler from raw data.
Libraries for Web Scraping in Python
There are several libraries available in Python for making web scraping simpler. Let’s discuss the three most popular libraries here.
#1. BeautifulSoup
One of the most popular libraries for web scraping. BeautifulSoup has been helping developers scrape web pages since 2004. It provides simple methods to navigate, search and modify the parse tree. Beautifulsoup itself also does the encoding for incoming and outgoing data. It is well-maintained and has a great community.
#2. Scrapy
Another popular framework for data extraction. Scrapy has more than 43000 stars on GitHub. It can also be used to scrape data from APIs. It also has a few interesting built-in support, like sending emails.
#3. Selenium
Selenium is not mainly a web scraping library. Instead, it is a browser automation package. But we can easily extend its functionalities for scraping webpages. It uses the WebDriver protocol for controlling different browsers. Selenium has been in the market for almost 20 years now. But using Selenium, you can easily automate and scrape data from webpages.
Challenges with Python Web Scraping
One can face many challenges when trying to scrape data from websites. There are issues like slow networks, anti-scraping tools, IP-based blocking, captcha blocking, etc. These issues can cause massive problems when trying to scrape a website. But you can effectively bypass challenges by following some ways. For example, in most cases, an IP address is blocked by a website when there is more than a certain amount of requests sent in a specific time interval. To avoid IP blocking, you’ll need to code your scraper so that it cools down after sending requests. Developers also tend to put honeypot traps for scrapers. These traps are usually invisible to bare human eyes but can be crawled by a scraper. If you are scraping a website that puts such a honeypot trap, you’ll need to code your scraper accordingly. Captcha is another severe issue with scrapers. Most websites nowadays use a captcha to protect bot access to their pages. In such a case, you might need to use a captcha solver.
Scraping a Website with Python
As we discussed, we’ll be using BeautifulSoup to scrap a website. In this tutorial, we will scrape the historical data of Ethereum from Coingecko and save the table data as a JSON file. Let’s move on to building the scraper. The first step is to install BeautifulSoup and Requests. For this tutorial, I’ll be using Pipenv. Pipenv is a virtual environment manager for Python. You can also use Venv if you want, but I prefer Pipenv. Discussing Pipenv is beyond the scope of this tutorial. But if you want to learn how Pipenv can be used, follow this guide. Or, if you want to understand Python virtual environments, follow this guide. Launch the Pipenv shell in your project directory by running the command pipenv shell. It will launch a subshell in your virtual environment. Now, to install BeautifulSoup, run the following command: And, for installing requests, run the command similar to the above: Once the installation is complete, import the necessary packages into the main file. Create a file called main.py and import the packages like the below: The next step is to get the historical data page’s contents and parse them using the HTML parser available in BeautifulSoup. In the above code, the page is accessed using the get method available in the requests library. The parsed content is then stored in a variable called soup. The original scraping part starts now. First, you’ll need to identify the table correctly in the DOM. If you open this page and inspect it using the developer tools available in the browser, you’ll see that the table has these classes table table-striped text-sm text-lg-normal. To correctly target this table, you can use the find method. In the above code, first, the table is found using the soup.find method, then using the find_all method, all tr elements inside the table are searched. These tr elements are stored in a variable called table_data. The table has a few th elements for the title. A new variable called table_headings is initialized for keeping the titles in a list. A for loop is then run for the first row of the table. In this row, all elements with th are searched, and their text value is added to the table_headings list. The text is extracted using the text method. If you print the table_headings variable now, you’ll be able to see the following output: The next step is to scrape the rest of the elements, generate a dictionary for each row, and then append the rows into a list. This is the essential part of the code. For each tr in the table_data variable, first, the th elements are searched. The th elements are the date shown in the table. These th elements are stored inside a variable th. Similarly, all the td elements are stored in the td variable. An empty dictionary data is initialized. After the initialization, we loop through the range of td elements. For each row, first, we update the first field of the dictionary with the first item of th. The code table_headings[0]: th[0].text assigns a key-value pair of date and the first th element. After initializing the first element, the other elements are assigned using data.update({table_headings[i+1]: td[i].text.replace(’\n’, ‘’)}). Here, td elements text is first extracted using the text method, and then all \n is replaced using the replace method. The value is then assigned to the i+1th element of table_headings list because the ith element is already assigned. Then, if the data dictionary length exceeds zero, we append the dictionary to the table_details list. You can print the table_details list to check. But we’ll be writing the values a JSON file. Let’s take a look at the code for this, We are using the json.dump method here to write the values into a JSON file called table.json. Once the writing is complete, we print Data saved to json file… into the console. Now, run the file using the following command, After some time, you’ll be able to see the Data saved to JSON file… text in the console. You’ll also see a new file called table.json in the working file directory. The file will look similar to the following JSON file: You have successfully implemented a web scraper using Python. To view the complete code, you can visit this GitHub repo.
Conclusion
This article discussed how you could implement a simple Python scrape. We discussed how BeautifulSoup could be used for scraping data quickly from the website. We also discussed other available libraries and why Python is the first choice for many developers for scraping websites. You may also look at these web scraping frameworks.