This is an small implementation of webcrapping data from a real website using Pandas and then formatting it using BeautifulSoup. I use a Jupyter Notebooks instance to execute this code.
Site used:
List of largest companies in the United States by revenue
To begin, I import the libs and set the variables that I’ll be using to get the data

I inspect the elements on the webpage, to find the table class:

I use the find method to find the ‘table’ values in the page HTML
- I first use find and find all to find all the generic tables
- Then I look for specifically the first occurrence of the table class with the proper named class_. This is investigative work that varies from each page that I plan to work on.

- I ultimately setttle with assigning the table value to a variable after finding the first occurrence

- Then, I look for the <th> tags that contain the headers for this table and I add it to a variable

- Then extract the text using a loop that will take only the text out of those variables and cleanup the data using strip()

- Troubleshooting: Do not use the find all on the soup. Use it on the newly created table variable


- To operate the data structures and create new data frames, I use pandas:

- I assign the data frame to a variable after saving the headers using the DataFrame() method

- Back at the list of largest companies, I notice that <tr> represents rows and <td> represents data
- <td> tags usea number to determine their position. eg: “<td>”1 is row 1, ”<td>2” is row 2, etc

- For fetching the data, ultimately I use a for loop that will find all the <td> tags and will strip all the text from the occurrences of the tag. It’s combined with strip() to clean it and is stored in a variable.

-
However, the data being stored is only the last row data. It’s not saving the full list of data. Pandas can be used to circunvent this:
- I’m saving the length of the data frame and storing the data in each location of the data frame as I go through the loop. For this one the fact that there’s a empty value on this first row means it won’t fit the current structure so I have to include “[1:]:” for the loop to start on position 1 of the list instead of 0:


- The data is ready to be used on Pandas:

- To export a .csv to an output folder use the to_csv() method. Index = false removes the index column to make it look cleaner from the go.

The final result (Beautiful!)

Uploaded to Github
https://github.com/ArturNakauchi/DataAnalysisPortfolio/blob/main/pycode/Web Scraping Exercise.ipynb
This was a simple and lightweight exectution that can very useful to extract huge chunks of data and make them ready to be analyzed. From here I can create visualizations and manipulate the data in creative ways.
Having said that, for this project I decided to focus only on BeautifulSoup. Future projects will use this library in more detail as well as combine with other libraries.
Some improvements/branching on this code can be done as:
- Do it again, but with something I enjoy as a hobby (Gunpla, Warhammer, Final Fantasy, etc)
- Extract different types of data, in a timely manner. Make it so it saves in a doc every day.