December 24, 2018

Python Web Scraping

The Project

This post covers my experience with creating a simple web scraping script with Python. I use Selenium to scrape company data from inc5000’s website, which is a site that lists the 5,000 fastest growing companies each year.

Why this project?

My general interest in automation began at my day job as an Accountant. I work at a growing firm and as a result of this, there is never a shortage of work and we always seem to be doing “more with less.” In fact, I’ve noticed an increasing amount of tasks appearing on my own to-do list. It’s to the point where our firm either needs more manpower or we need to automate tasks wherever possible. I think it’s fairly obvious that the latter option the direction our world is headed.

When I started working at my current job, we did have a good amount of automation already in place and there were some really useful Excel macros that I use on a daily basis to save many hours per week. However, many of our clients change requirements or ask for new things altogether, so I felt it was necessary to learn some VBA programming as I didn’t want to continue to spend 60 hours per week in the office like I did in my first year working at my job.

So, a few years ago, I picked up a VBA book (Excel 2010 Power Programming with VBA by John Walkenbach) and started reading through it. I do learn well with textbooks generally, but I struggled with this book. It was massive and quite frankly, really boring. Thankfully though, I was fortunate enough to have a large amount of existing macros that I could reference. I loved stepping through the code line by line on one monitor and viewing the real-time results on the other.

By far, I learned the most by reviewing the code in our existing macros and taking the time to understand what they did. I think a major reason for this was because the macros were relevant to my every day tasks and had meaning to me. Excel is such a vast program and most of the examples in the textbook had no significance to me. When I started really reviewing how these macros worked under the hood, I realized that many components of one macro were similar to things I needed to do in another one. With a little copying, pasting and changing a few lines of code, I had a couple of working macros before I knew it!

Google is also an incredible tool to assist you in writing macros for Excel. It is obviously useful for other things as well, but my point is that VBA has a large community of users and it’s virtually guaranteed that someone has already come across the issue you are currently having. “Googling” is an essential skill to have for VBA programming, but you do need to know a bit about the language first to know what’s possible. It takes a bit of practice and experience to know what to ask and then it takes time to figure out how to implement the answers you receive from Google into your own project as most of the time they are not verbatim solutions.

Being able to have a little bit of success with VBA was really motivating and it was addicting to want to keep improving. It was fascinating to me to be able to eliminate menial tasks from my schedule. For example, sometimes I would need to combine 50 workbooks into one to do some data analysis or reconciliation. Manually opening up all of the spreadsheets one at a time and copy and pasting them into a “master” spreadsheet takes a lot of time! It was so cool to write a script to accomplish a task like this in just a few seconds when it previously took an hour. Needless to say, I saw a lot of opportunity there.

After some time and practice, I looked around on various job sites to see what opportunities were out there for VBA developers. To my disappointment, there were a very underwhelming amount of openings, especially in my area. I considered freelancing and trying to get businesses to let me develop macros for them, but the problem I found with that are most people who really need automation in their jobs have no idea that it’s even possible. It would have been a major uphill battle to attract clients and it wouldn’t have been all that profitable to begin with. So, I quickly put VBA on a back-burner and started focusing my efforts on other more popular programming languages and what they have to offer. Enter Python.

I first became aware of Python when I was searching for technology related podcasts that I could listen to during work to help pass the time. I came across the show “Talk Python to Me” hosted by Michael Kennedy and was instantly intrigued by what people were doing with Python. The show titled, “Automating the Web with Selenium and InstaPy” is the only that inspired me to do this project. Additionally, “Automate the Boring Stuff with Python” by Al Sweigart is an extremely popular book that helps people understand the power and necessity of automation. I was able to relate many of the tasks I do at work to the examples in that book. Those are two great resources for anyone with an interest in getting started with Python. The podcast is a little more high level and not quite as beginner friendly, however it gives a great overview of some practical uses of Python in the real world. The book is more beginner friendly, but personally I would still complement the book with a more standard textbook, some YouTube tutorials, or even a Udemy course on Python.

My main objective was to get some practice with creating a useful script that went beyond basic tutorials. I identified Python as a good and easy tool to begin doing this with. One disclaimer I have to mention is that this is my first experience using Python and it’s the first project I have made with the language. Most of the programming experience I have is with Java, but fortunately I was able to put together a working solution relatively easily.

I chose inc5000’s website mainly because small business is of interest to me. They have the company’s ranked and the relevant company data contained in the specific pages, but I thought it might be interesting to compile it all so that it could be better used for analysis instead of just viewing it on the actual website. So, I wrote a script that visited each of the 5000 company pages listed on the website for 2018, collected some data points, and then wrote them to a CSV file.

Why Python and Selenium?

As I mentioned in the previous section, I chose Python for this project mainly due to its ease of use and minimal setup required to get up and going. Other programming languages, such as Java, would also work for this project. Maybe next time.

I had to do a little bit of research on what tool to use to scrape the data and I’ll try and simplify my thought process.

The best options for web scraping with Python according to the overall consensus of the Internet are:

  1. Scrapy – robust library but overkill for this simple task in my opinion
  2. Urllib – sort of complicated and doesn’t handle pages with JavaScript well (we will come back to this shortly)
  3. Selenium
    • Beginner Friendly (Well, I’m a beginner…)
    • You get a real browser to see what’s going on (Cool, that’s something I liked about VBA as well.)
    • Good with pages that deal with JavaScript.

There are also some others (Beautiful Soup, LXML, etc.), but Selenium seemed like the best option for this project. When selecting a tool to use for any project, it’s important to have an idea of what kind data, website, etc. you are going to be working with. Forks are really great tools to eat with, however if the project was to paint a wall you would be best served looking at other options, such as a paintbrush.

In order to pick the best tool, I also had to analyze the inc5000 website and take a look at the structure of it. I noticed that it had a lot of JavaScript and there were buttons that needed to be clicked on in order to view more companies or display extra data and what not. Based on my research, it seemed like Selenium was the best tool to use for handling websites with JavaScript on them and that was probably the biggest factor in choosing Selenium for the project.

Python and Selenium worked great for the project and I was able to get exactly what I wanted in only a day’s work. The one downside I found is that Selenium is incredibly slow. The script took a few hours to run before successfully visiting all 5,000 pages and extracting the data that I wanted. In a way though, it was cool as I was able to go to the store, do some laundry, and complete other tasks all while the program was working for me.

The Project

The full source code of the program can be found here: Source Code

The first 17 lines below of the program below are just for some initial setup.

  1. Web Driver- this is the package that allows us to browse the web, navigate to web pages, and interact with JavaScript (line 1). I used Chrome in this example, but there are options such as Firefox and Safari. You can download the Chrome Driver here. I create the web driver object in line 13 and it is in the same directory as the python script.
  2. Explicit Wait- Lines 2 through 4 import packages that allow us to explicitly wait for the web page to load before extracting the data. I found these to be necessary because I ran into a few issues with the web page not loading fast enough at times and this caused certain data points to not be grabbed properly or the program would sometimes crash altogether. The explicit wait statements made sure that the program didn’t try to extract data before the page was ready and fully loaded.
  3. No Such Element Exception- This import statement is used to handle any situations where web elements don’t exist on a particular page. During my testing of the program, I found that some of the company profile pages had a slightly different HTML structure where one of the categories, say revenue for example, was missing. If I didn’t handle this exception, the program would crash.
  4. CSV- this package allows us to read and write to CSV files. We create a file in line 10 that is called CompanyData2018.csv and I store it in the root directory next to the Chrome Driver and the Python script.
  5. Time- this allows us to set a wait time in our script. There are other uses for this package but temporarily pausing the script is what I used it for in this project. I found it useful when interacting with JavaScript as I was able to just give the script a second to render the new content and it helped ensure that all proper elements were ready to be extracted.

Here is the starting URL that I began the script from:

Those categories were interesting, but there is more information when you click on the company name. Below is a look at what it looks like when you click on each company link. At the top you will see the company name, the quick description of what the company does, and a few other bits of information that are not in the picture. Below, you will see the HTML code and our goal is to use Selenium to select each of the desired elements.

Below are the data points that I decided to collect (these were all extracted from the specific company pages):

  1. Rank
  2. Name
  3. Revenue
  4. 3 Year Growth Rate
  5. Industry
  6. Location
  7. Year Founded
  8. Number of Employees
  9. Website
  10. Description

After some trial and error, I was able to get the program working properly. If you plan on trying Selenium on your own, you’ll have to spend some time playing around with the ‘find_element_by method’ in order to select the correct elements. Sometimes it can be kind of tricky based on how the HTML of the page is structured, but it’s important to have patience and realize that it does get easier after a while. This is the resource I used to help me locate elements: Locating Elements.

Another important thing to understand when writing a script like this is that it isn’t all that sustainable in the long term. If the owner of the website decides to completely change the page structure for instance, the program we created will become useless or at the very least require lots of maintenance and frequent tweaks. For example, one of the company pages I came across was “unavailable.” There were obviously no elements that could be selected in that type of situation. That’s a good reason to have exception handlers like the “NoSuchElementException.” Then, we can properly deal with situations like that and the program continues to run instead of crashing at the first sign of adversity.

Again, you may take a look at the full source code on my Github. It’s pretty basic Python and I just used a few nested ‘for loops’ to get the job done. I’ve made comments in the code so that you can see the specific functions that each part of the program performs. Feel free to try it out on your own, but just keep in mind that Selenium is really slow and visiting the 5,000 links will take several hours. In line 29, you can change the 99 to a 1 to just grab the first 100 companies. That way, you still do the part where you click on the button and go to the second page. Once you have that aspect of the program down, the rest is just repeated action. Below is a picture of the button you need to click on to view the next 50 companies:

Conclusion

All in all, I had fun making this program and learned quite a bit. It felt good to end with a working solution. I’m sure there are more elegant ways to do certain things, but I’m not going to be too hard on myself because I have only spent a few hours reading up on Python before starting the project. Python is definitely a language I want to continue to learn as it’s very popular and the job outlook looks promising. It was definitely a fun and easy language to use.

As far as the actual data I extracted goes, I’m not too sure what I want to do with it, if anything. Maybe I’ll take a look at a few companies I find interesting and share some thoughts on them. I’m not too sure. This was more of a programming exercise, but it’s possible I’ll end up finding some use for the data. Anyway, if you’ve made it this far, thanks so much for taking the time to read through my post and I sincerely hope you enjoyed it and found it useful!