Each website is built by different individuals so there is no one magic script that can scrape different websites because the art of scraping lies in deconstructing the html elements into separate objects that we can extract or “scrape” the information from. Of course there are scraping tools out there that automate these scripts based on your preference (literally google free web scraping tools, there are a bunch of them). I have tried several of these tools and I always end up getting frustrated and going back to writing the codes myself because a) they always have a pricing tier despite claiming to be free, and b) the process is overcomplicated and some websites are just not built intuitively that these tools can easily automate. This tutorial specifies the necessary steps to scrape data from Jobstreet Malaysia but the process of crawling and scraping boils down to a few basic steps that can be applied to almost any website.
Web Scraping vs. Web Crawling
These two terms get thrown out a lot when it comes to automating data acquisition from the internet. Sometimes it is used interchangably to mean the same thing but they are actually different methods that are used together in achieving the same goal which is to extract information from a website that is presented in a way that has a pattern. This pattern is the key in writing codes that can iterate from one page to another. For example let’s look at jobstreet’s search page.
There are several patterns here that we can leverage on. At this point you have to ask yourself, how do I get all these links? If you weren’t aiming to crawl and to scrape, how would you manually do it? In this example, jobstreet provides a paginated result page. There are a fixed amount of number of links in each page and they are all in similar looking boxes which suggests that the styling would use the same ID, this is important for scraping later, but as for now, in order to crawl and get all the relevant links, you need to determine how the data is populated.
Another example is Facebook which does not have pagination, on the contrary it uses infinite scrolling so the method of crawling is completely different. Even if a website has pagination, we need to determine how the website processes user input and goes to the next page. Most commonly, websites implement URL parameters which means the number of page is in the url like
www.website.com/2/. In these cases, going from one page to another is pretty straightforward. However, some websites implementing things like servlets and do not take parameters in the URL are trickier to crawl but still possible. You can maybe imitate certain buttons clicking or imitate scrolling, anything is possible.
Search keyword is appended in the url
Because jobstreet uses URL parameters, it’s just a matter of modifying the URL in each iteration. Here we can also see that the keyword is appended at the end of the URL as such,
www.jobstreet.com.my/en/job-search/data-jobs, this allows for querying multiple keywords.
There are 1832 job postings, with 30 jobs listed on each page
This means we only need math to determine how many iterations we need to go through all pages. We determine the class name of the text
1832 jobsby inspecting the element and from there we can simply do the following
So now we have the page number, 8 which allows us to crawl all 8 page results and scrape the relevant links.
Is it Legal?
The short answer is an astounding Yes! We are not hacking into anyone’s private account. We are simply automating the process of one person manually clicking through a website 200 times and copying and pasting information, into a program that we can leave for hours and maybe days that would ultimately save our time but has no damaging consequences to the website owners. Ethically though, what we do with the information is another issue and the fact that scraping as a tool for morally ambiguous activity is up for debate, like how Mark Zuckerberg scraped images of women in sorority houses in Harvard to create a website comparing which woman is hotter is unanimously frowned upon but it is 100% legal.
To run the script, simply
- Clone this Github repository
- Create a virtual python environment (optional). Run the following script in terminal
python -m venv venv source venv/bin/activate
- Once the environment is activated, install the requirements
pip install -r requirements.txt
- Download chromedriver based on your chrome version and determine your version and header here
python jobstreet_scraper.py in terminal and you should see the scraping process starting.
Breakdown of functions
The script consists a few key steps in successfully crawling and scraping which are
1. Configure driver to browse web pages & BeautifulSoup4 to scrape html codes
This part is very important because it is what automates the process. The driver’s purpose is to visit the pages that we tell it to. So in this code snippet, we have determined a URL template which essentially accepts the parameters and generate a fully functional URL because we have identified its pattern. From here, we tell the driver to go to that URL and then we use BeautifulSoup to identify the html elements that exist on the page. This is how scraping is possible because every website’s user interface is in html & CSS, which means there are html tags like
<span> amongst many other tags that wraps around information which allows for a beautiful and readable user interface. These elements normally have unique IDs for specific stylings. Reuse of the same style is possible with the same ID so what we need to do is basically identify that class name or ID and we can deconstruct the codes to it’s barest element which is text.
Now that we know how to go from page to page, we’ll start back at the beginning where we have certain keywords to query. 2. Iterate through keywords
If you have only one keyword to query, then you might not need the following code snippet but in most cases, you want to scrape several keywords, so essentially what you need to do is iterate through the keywords one by one and once you have the resulting dataframe for each keywords, combine all the dataframe into one.
2. Get Page number
As mentioned briefly earlier, we need to know how many times to iterate. Alternatively, you can also do a while loop to constantly loop and go to the next page until there are no more results. This can only work if the website does not redirect you to another page once there’s no more results left which is unfortunately what jobstreet does so in this case we determine the number of pages there are by looking at the number of results stated. In order to scrape that particular text, we inspect the element by going to
More Tools > Developer Tools which allows us to use the element picker tool and determine the html code for that element. For jobstreet, this is the code related to the result page
As you can see, the text is wrapped around a span tag with the class name
"FYwKg _2Bz3E C6ZIU_0 _1_nER_0 _2DNlq_0 _29m7__0 _1PM5y_0" which is quite common for an element to have more than one css class which is fine in our case because we want a singular value.
The way of scraping text that is wrapped in a span tag as such would be like the following codes. After the text is scraped, our job is not done because we want to perform some calculations on the number which means it needs to be converted from string to integer. It’s not as straightforward because the whole sentence contains actual words so what we need to do is split the string into a list of strings which would allow us to choose the string with the just the number, which is the third element from the right. From there, we convert it into integer and perform the calculation.
3. Iterate through pages
So now that we have a page number, what we need to do is iterate exactly 8 times with each iteration updating the page number parameter in the url as well as scraping the links on each page, repeating the steps before of inspecting elements and determining the tag and class name of the element that we want to scrape, in this case we want links so normally it is
<a> tag and the class name should be consistent because the styling is the same for all results.
4. Time delay to avoid suspicion
Sometimes it’s important to delay a process for a certain amount of time because we might be causing some heavy traffic in such a short time which could lead to some suspision. Therefore for some websites, it is important to add a delay of 1 second or 2 in obvious junctions of the project. If the website has no issue, then you can skip this part.
5. Scrape relevant data
Now we have the links, the next step is to go to each link and scrape relevant informatino. Normally we would repeat the process of inspecting elements and such. However, if you’re lucky, some websites include json format of the information in the website’s script so it makes your job easier because the key is consistent throughout.
6. save data After having no errors, and the program runs successfully, it is important to save the dataframe as a csv to your local root folder.
And there you have it. A web crawler scraper program that can scrape job listings information for you!