There is a lot of information out there, that can be helpful for research, or personal interests, and the hardest part of it is dealing with all this information in different sources, keeping track of where you got so far, and hence the web scraping is so popular among these days. With a few simple lines and a bit more structure, all that information can be gathered and filtered in one place.
In this article, I'll walk you through the entire pipeline of scrapping web pages, front to back. I'll set up an application that will search for part-time remote jobs, but the same process and tools can be applied to any static website on the web. We could of course craft some beautiful C# code, but there would be a problem with it, it is not python! And everybody is asking for scrape with python.
Please note that this article is for educational purposes only. Always consult with website-specific terms and conditions to verify whether web scrapping is allowed by website owners or not. Another indication is to check robots.txt
file if scrapping is not mentioned in legal-related pages.
Research your datastore
The first and most important part of web scraping is making sure your data exists in a structured manner. If you have to crawl a piece of information that's messed with flat text around, scrapping is not the right approach to the problem, however, chances are that data is structured somehow. Take the moster.com index page, for instance, all jobs are inside a div whose id is "SearchResults".
Depending on the document structure, you could choose CSS selector
or XPath
, with the first being the fastest among those two, and the latter being the most powerful. Besides being a fan of one or the other way, usually, you can decide whether you will need to traverse the dom tree both directions or one way only. You should put your energy trying for maintainable code, don't let the speed be a factor of your choice, because web scraping is in the business of gathering data, not in the business of milliseconds.
Study the URL, there is a good chance that just by constructing the URL you can query the website, without having to submit any form programmatically. i.e visit https://www.monster.com/jobs/advanced-search/ then complete the form as the following:
Job Title: C#, Location: Remote, Job Type: Part-Time and hit search, now if you notice the URL has become: https://www.monster.com/jobs/search/Part-Time_8?q=C__23&where=Remote and can be split with:
/Part-Time_8
as route component addition?q=C__23
C# as an encoded query string&where=Remote
Location, also part of the query string.
Develop the work plan
Now that we are sure the data is structured, we know scraping will be a breeze, but before jumping on the keyboard lets develop the work plan.
If you notice the monster.com or indeed.com websites both of them won't show in the main screen all the information on the job, in other words, Title, Location, and Company are already present, but you have to click on job box to view the description.
With this information in mind, we know that one request will be for retrieving all jobs metadata, and another request will be for each one of the jobs, so the plan will be like:
- Retrieve website data
- Parse job metadata
- For each job, retrieve the description response
- Parse the description response
- Repeat the process for each job board
- Filter gathered data based on title/description keywords that you want or don't want to be present
Implementation
Enter your virtual environment and install the following packages:
- BeautifulSoup4: a python library for navigating and querying structured documents such as HTML and XML
- lxml : powerful and fast library for parsing XML and HTML, we will instruct
BeautifulSoup
to use this package instead of default pythonhtml.parser
- requests: this is the package we are going to use for downloading web content. And since we will be doing numerous downloads, let's make it a reusable function:
import requests
from requests.exceptions import HTTPError
class HttpHelpers:
def __init__(self):
self.session = requests.Session()
def download_page(self, url):
try:
response = self.session.get(url)
response.raise_for_status()
except HTTPError as httpErr:
print(f'Http error occurred: {httpErr}')
return None
except Exception as err:
print(f'A generic error occurred: {err}')
return None
else:
return response.content
Next will be parsing the main result. If you notice every job is inside a section
element with a CSS class named card-content
. Load the HTML content we get from the previous function into BeautifulSoup class:
soup = BeautifulSoup(htmlcontent, 'lxml')
# get the conteiner with all jobs
jobs_container = soup.find(id='ResultsContainer')
# as you notice by far, the result of BeautifulSoup find function is also a querable dom item.
# Let's get all sections that represent a single job posting
job_items = jobs_container.find_all('section', class_='card-content')
if job_items is None or len(job_items) == 0:
return []
Now we browse all the results, and for each of them we already have Title and Company, but no description. This information can be retrieved from the link specified in job title, and BeautifulSoup offers get
method to retrieve attribute values out of a html tag.
all_jobs = []
for job_elem in job_items:
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
url_elem = job_elem.find('a')
if None in (title_elem, company_elem, url_elem):
continue
# get the full url of this listing
href = url_elem.get('href')
if href is None:
continue
item = {
"title" : title_elem.text.strip(), # use .text property to retrieve text content
"company" : company_elem.text.strip(),
"href" : href,
"description" : "",
"description_text" : ""
}
all_jobs.append(item)
Now that we have the job metadata, it's time to retrieve a full description of this listing. What we can do now is for every job returned, parse through the page content and update this listing dictionary.
parsed_details = self.__parse_details(job_content)
job["description_text"] = parsed_details[0]
job["description"] = parsed_details[1]
You can get the full source of this application in https://github.com/ermirbeqiraj/web-scraper
To run it you will need to update settings.ini
file:
# put the monster.com URL here, filtered as for your case
MONSTER_URL=https://www.monster.com/jobs/search/?q=C__23&where=Remote
# put indeed.com URL in here, filtered as for your case
INDEED_URL=https://www.indeed.com/jobs?as_and=C%%23&jt=all&l=Remote&fromage=1
# keywords you don't want a job description to have
DESCRIPTION_DOESNT_CONTAIN=php,ruby,go