Python Code Snippet #3 - Webscraping Data w/ BeautifulSoup

This is an small implementation of webcrapping data from a real website using Pandas and then formatting it using BeautifulSoup. I use a Jupyter Notebooks instance to execute this code.

Site used:

List of largest companies in the United States by revenue

To begin, I import the libs and set the variables that I’ll be using to get the data

Untitled

I inspect the elements on the webpage, to find the table class:

Untitled

I use the find method to find the ‘table’ values in the page HTML

I first use find and find all to find all the generic tables
Then I look for specifically the first occurrence of the table class with the proper named class_. This is investigative work that varies from each page that I plan to work on.

Untitled

I ultimately setttle with assigning the table value to a variable after finding the first occurrence

Untitled

Then, I look for the <th> tags that contain the headers for this table and I add it to a variable

Untitled

Then extract the text using a loop that will take only the text out of those variables and cleanup the data using strip()

Untitled

Troubleshooting: Do not use the find all on the soup. Use it on the newly created table variable

Untitled

To operate the data structures and create new data frames, I use pandas:

Untitled

I assign the data frame to a variable after saving the headers using the DataFrame() method

Untitled

Back at the list of largest companies, I notice that <tr> represents rows and <td> represents data
- <td> tags usea number to determine their position. eg: “<td>”1 is row 1, ”<td>2” is row 2, etc

Untitled

For fetching the data, ultimately I use a for loop that will find all the <td> tags and will strip all the text from the occurrences of the tag. It’s combined with strip() to clean it and is stored in a variable.

Untitled

However, the data being stored is only the last row data. It’s not saving the full list of data. Pandas can be used to circunvent this:
- I’m saving the length of the data frame and storing the data in each location of the data frame as I go through the loop. For this one the fact that there’s a empty value on this first row means it won’t fit the current structure so I have to include “[1:]:” for the loop to start on position 1 of the list instead of 0:
- The data is ready to be used on Pandas:
- To export a .csv to an output folder use the to_csv() method. Index = false removes the index column to make it look cleaner from the go.
The final result (Beautiful!)

Uploaded to Github

https://github.com/ArturNakauchi/DataAnalysisPortfolio/blob/main/pycode/Web Scraping Exercise.ipynb

This was a simple and lightweight exectution that can very useful to extract huge chunks of data and make them ready to be analyzed. From here I can create visualizations and manipulate the data in creative ways.

Having said that, for this project I decided to focus only on BeautifulSoup. Future projects will use this library in more detail as well as combine with other libraries.

Some improvements/branching on this code can be done as:
- Do it again, but with something I enjoy as a hobby (Gunpla, Warhammer, Final Fantasy, etc)
- Extract different types of data, in a timely manner. Make it so it saves in a doc every day.

The final result (Beautiful!)