Analyzing data scientist growth rate from Naukri website using web scraping
Presenting by Madhan Kumar Selvaraj
Before doing the coding part, we need to understand how the Naukri website is loading its data and how they are processing it. We have an option in Scrapy to load the website and we can test their response. Now I am loading the Naukri's data scientist job posting webpage in the Scrapy shell.
Now run the "view response" command to view the response fetched by the scrapy and in the browser you can view the data
It is clear that the Naukri website is checking where the user is human or robot. That's why they are checking our request using the Captcha. Now we need to analyze the website more to extract their data.
Here I inspected the page in Firefox application by selecting the Inspect element option while clicking the right click on the second page on the Naukri website. Then opened the Network to check the request and response handled by the website. Below I annotated the image for the reference
Here I found that the Naukri website using body data in the name of params to perform the POST method. So I tested it with the POSTMAN application to have a detailed view.
At last, we found a way to extract the actual data from the website. Now we need to extract the required data from the response HTML by using either Xpath, CSS or by the regular expression.
Tips- Nowadays, the website will block our IP address when we send too many requests. I personally prefer to load the HTML file in my local machine and then I will apply all my analysis and then I'll hit the actual website server.
Here I am storing the response from the website in the HTML format by requesting the POST method. Save this file under the spider folder. D:\Project\ds_growth_rate\ds_growth_rate\spiders This is the path of the spider folder in my system.
Here is the Xpath to extract the company name
I created MySQL database name "ds_growth" and table name as "ds_tables". Create it by using the above-mentioned tutorial.
What you will get from this blog
- Basics of extracting the data from the web page by using the web scrapy technology in Python
- Integrating Python with the MySQL database by loading, fetching and manipulation part in a few lines of code (It seems good right)
- PySpark is a big data technology for cluster computing framework to work with billion of data by in-memory processing
- Basic concepts of Pandas, seaborn and other visualization part.
Basic flow chart of the project
In this blog, we are going to know the current trend of the data scientist role by using the following technology- Python web scrapy
- MySql
- PySpark
- Pandas
- Seaborn
This blog is not for those
- Who is new to the programming language
- Who is not having any idea about the web scrapping
- Who are not familiar with the big data technologies
Prerequisite
- Python 3.X version
- Refer my blog where you will get all the reference link to install Python and other dependencies
- Scrapy
- Install Python scrapy by following the link. Later we'll go through the basics of Scrapy
- Anaconda for Jupyter notebook
- From this site, you will get Conda, Jupyter notebook, Spyder IDE and many more
- Pycharm IDE (Optional)
- I personally like this IDE for Python coding. Refer my previous blog
- MySQL server
- Download MySQL from the official site and select the option of MySQL server while installing
- JAVA
- PySpark works on the Java environment. So refer to this official site
- PySpark
- Later on this blog, I'll explain the installation process
Web scrapping
It is the process of extracting the data from the web page using some technology. Here we used the Naukri website to extract the Data scientist job posting data by using the Python Scrapy. I choose this website because it is one of the leading job portals in India and we will get accurate data from here.
Note - We are not using the Naukri website for commercial or money-minded purpose and it is legal for educational purposes. Refer to this link for more.
Start with scrapy
I don't want to explain more about the basics, instead I'll concentrate on the scraping techniques here. Refer the link to get some basics about the scrapy and also create some basics web spider to get familiar with it.
Open a path where you are creating your project and run the below command to create a spider where I used the name ds_growth_rate
Open a path where you are creating your project and run the below command to create a spider where I used the name ds_growth_rate
scrapy startproject ds_growth_rateYou will get output like below and scrapy will automatically create the necessary files
Before doing the coding part, we need to understand how the Naukri website is loading its data and how they are processing it. We have an option in Scrapy to load the website and we can test their response. Now I am loading the Naukri's data scientist job posting webpage in the Scrapy shell.
scrapy shell "https://www.naukri.com/data-scientist-jobs"Scrapy will load your request and you will get output like below
Now run the "view response" command to view the response fetched by the scrapy and in the browser you can view the data
It is clear that the Naukri website is checking where the user is human or robot. That's why they are checking our request using the Captcha. Now we need to analyze the website more to extract their data.
Here I inspected the page in Firefox application by selecting the Inspect element option while clicking the right click on the second page on the Naukri website. Then opened the Network to check the request and response handled by the website. Below I annotated the image for the reference
Here I found that the Naukri website using body data in the name of params to perform the POST method. So I tested it with the POSTMAN application to have a detailed view.
At last, we found a way to extract the actual data from the website. Now we need to extract the required data from the response HTML by using either Xpath, CSS or by the regular expression.
Tips- Nowadays, the website will block our IP address when we send too many requests. I personally prefer to load the HTML file in my local machine and then I will apply all my analysis and then I'll hit the actual website server.
Here I am storing the response from the website in the HTML format by requesting the POST method. Save this file under the spider folder. D:\Project\ds_growth_rate\ds_growth_rate\spiders This is the path of the spider folder in my system.
import scrapy import json class QuotesSpider(scrapy.Spider): name = "naukri" # Spider name body_data = "qp=data+scientist&ql=bangalore&qe=&qm=&qx=&qi%5B%5D=&qf%5B%5D=&qr%5B%5D=&qs=r&qo=&qjt%5B%5D=&qk%5B%5D=&qwdt=&qsb_section=home&qpremTagLabel=&sid=15810530081808&qwd%5B%5D=&qcf%5B%5D=&qci%5B%5D=&qck%5B%5D=&edu%5B%5D=&qcug%5B%5D=&qcpg%5B%5D=&qctc%5B%5D=&qco%5B%5D=&qcjt%5B%5D=&qcr%5B%5D=&qctags%5B%5D=&qcl%5B%5D=&qrefresh=&xt=adv&qtc%5B%5D=&fpsubmiturl=https%3A%2F%2Fwww.naukri.com%2Fdata-scientist-jobs-in-bangalore&qlcl%5B%5D=&latLong=" def start_requests(self): url = "https://www.naukri.com/data-scientist-jobs" yield scrapy.Request(url=url, method='POST',body=self.body_data, callback=self.parse) def parse(self, response): file_data = open("naukri_file.html", "wb") file_data.write(response.body) file_data.close()Here our spider name is "naukri" and you can check in the second line of the code. Run the below command to execute the spider. "naukri_file.html" is the file name.
scrapy crawl naukriThe response file will be stored in the HTML format under the spiders folder. To execute the file run the below command. Change the file path according to your machine.
scrapy shell file:///D:/Project/ds_growth_rate/ds_growth_rate/spiders/naukri_file.htmlNow we will use the XPath to find the posting data details like company name, experience, skills, location and many more. Let we will extract the company name of the first posting mentioned in the below image
Here is the Xpath to extract the company name
response.xpath('.//span[@class="org"]/text()').extract_first()I extracted a few data by using Xpath and populated it in the below image. Refer to this link to get familiar with Xpath. I used the only Xpath if you want you can use CSS and regular expressions also.
Loading scrapped data into MySQL database
I hope you installed MySQL server in your machine and also you need to install MySQL connector by using the below command using PIP
pip install mysql-connector-pythonHave a walk through in MySQL in Python concepts for those who are not familiar with it
I created MySQL database name "ds_growth" and table name as "ds_tables". Create it by using the above-mentioned tutorial.
Complete code for web scraping loading data into database
I tried to explain clearly by commands in each code. I hope you will understand.
from .ds_growth_db import insert_data # Impoting insert_data function to load data into database import scrapy class QuotesSpider(scrapy.Spider): name = "dsgrowth" # Spider name minimum_experience = 'NA' # Default value if there is no value while scraping maximum_experience = 'NA' next_page = [] # Body data for the POST method body_data = "qp=data+scientist&ql=bangalore&qe=&qm=&qx=&qi%5B%5D=&qf%5B%5D=&qr%5B%5D=&qs=r&qo=&qjt%5B%5D=&qk%5B%5D=&qwdt=&qsb_section=home&qpremTagLabel=&sid=15810530081808&qwd%5B%5D=&qcf%5B%5D=&qci%5B%5D=&qck%5B%5D=&edu%5B%5D=&qcug%5B%5D=&qcpg%5B%5D=&qctc%5B%5D=&qco%5B%5D=&qcjt%5B%5D=&qcr%5B%5D=&qctags%5B%5D=&qcl%5B%5D=&qrefresh=&xt=adv&qtc%5B%5D=&fpsubmiturl=https%3A%2F%2Fwww.naukri.com%2Fdata-scientist-jobs-in-bangalore&qlcl%5B%5D=&latLong=" # Our program starts in this function def start_requests(self): url = "https://www.naukri.com/data-scientist-jobs" # Main URL yield scrapy.Request(url=url, method='POST',body=self.body_data, callback= self.parse) def parse(self, response): # To handle exceptions try: dict_data = {'key_data': []} # Used to load data into database effectively for row_data in response.css('div.row '): # To extract all the jobs from a page designation = row_data.xpath('.//li[@class="desig"]//text()').extract_first() if designation is not None: # Checking the value company = row_data.xpath('.//span[@class="org"]/text()').extract_first() salary = row_data.xpath('.//span[@class="salary"]//text()').extract_first() skill = row_data.xpath('//span[@class="skill"]//text()').extract_first() posted_on = row_data.xpath('.//span[@class="date"]//text()').extract_first() experience = row_data.xpath('.//span[@class="exp"]//text()').extract_first() location = row_data.xpath('.//span[@class="loc"]//text()').extract_first() salary = "NAN" if (salary.strip() == "Not disclosed" or salary.strip() == "") else salary.strip() # Splitting experience value into min and max # Strip used to remove the unwanted spaces if experience is not None: split_result = experience.strip('yrs').split('-') if "-" in experience else experience if len(split_result) > 1: self.minimum_experience = (split_result[0]).strip() self.maximum_experience = (split_result[1]).strip() extracted_data =(designation.strip(), company.strip(), skill.strip(), salary.strip(), posted_on.strip(), location.strip(), self.minimum_experience, self.maximum_experience) # To convert values into list of tuples dict_data['key_data'].extend([extracted_data]) # Insert the extracted data into the database insert_data(dict_data['key_data']) self.next_page = response.xpath('//div[@class="pagination"]/a//@href').extract() self. next_page = self. next_page[-1] # Iterate to the next page to scrape the data if self.next_page is not None: yield scrapy.Request(self.next_page, method='POST',body=self.body_data, callback=self.parse) except Exception as e: print(e)
The below file is used to load data into the MySQL database and save it in the name "ds_growth_db". Use your database username, password, database name, table name
We completed the scrapping part and loading the data into the database. Now we'll use the Pyspark to fetch the data and do some basic visualization
Note - I thought to perform Machine Learning algorithms like Linear regression and KNN in this project. Unfortunately, all the values in the Salary column are NULL. So I'll perform some basic visualization by using the Skills and location data.
Now we need to fetch the data from the MySQL database. I spend a lot of time here to connect the DB with the PySpark. First, download the MySQL connector from this link. After extracting the file you will have two files. One is the bin jar file and another jar file. Copy those files under the spark/jars folder. Mine is under the path C:\Users\S\Spark\jars.
Run the below commands to link PySpark with the MySQL database. Use your database name, username, password and table name. Change it int he below code
Finally, It is clear that most of the companies are in the Bengaluru region and they are expecting skills of data science, Python and basics of Big data technology.
import mysql.connector # MySQL DB connector mydb = mysql.connector.connect( host="localhost", user="your_username", passwd="your_password", database="database_name") mycursor = mydb.cursor() # To create database and tables def create_database(): mycursor.execute("CREATE DATABASE database_name") mycursor.execute("CREATE TABLE table_name (designation VARCHAR(255), company VARCHAR(255), skill VARCHAR(300), salary VARCHAR(255), posted_on VARCHAR(255), location VARCHAR(255), min_experience VARCHAR(255),max_experience VARCHAR(255))") # To view the created tables mycursor.execute("SHOW TABLES") for x in mycursor: print(x) def insert_data(extracted_data): sql = """INSERT INTO table_name (designation, company, skill, salary, posted_on, location, min_experience, max_experience) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""" mycursor.executemany(sql, extracted_data) mydb.commit() print(mycursor.rowcount, "was inserted.") # To check the loaded data def check(): mycursor.execute("SELECT * FROM table_name") myresult = mycursor.fetchall() for x in myresult: print(x) # Call these functions only if it is needed # create_database() # insert_data() # check()Execute the program by using the below command
scrapy crawl naukriSuccessfully I extracted 7107 job postings from the Naukri website and also printed top values from the database
We completed the scrapping part and loading the data into the database. Now we'll use the Pyspark to fetch the data and do some basic visualization
Note - I thought to perform Machine Learning algorithms like Linear regression and KNN in this project. Unfortunately, all the values in the Salary column are NULL. So I'll perform some basic visualization by using the Skills and location data.
PySpark
Install Java and Anaconda in your machine if you are not installed till now. Follow the below links to install PySpark
- https://medium.com/@naomi.fridman/install-pyspark-to-run-on-jupyter-notebook-on-windows-4ec2009de21f
- https://datainsights.de/setup-pyspark-on-windows-laptop-2/
- https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/
import findspark findspark.init() findspark.find() import pyspark findspark.find()You'll get the below result if you installed Pyspark correctly. Or else follow the previous links to install properly in your machine
Now we need to fetch the data from the MySQL database. I spend a lot of time here to connect the DB with the PySpark. First, download the MySQL connector from this link. After extracting the file you will have two files. One is the bin jar file and another jar file. Copy those files under the spark/jars folder. Mine is under the path C:\Users\S\Spark\jars.
Run the below commands to link PySpark with the MySQL database. Use your database name, username, password and table name. Change it int he below code
import os from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession from pyspark.sql import SQLContext conf = pyspark.SparkConf().setAppName('appName').setMaster('local') sc = pyspark.SparkContext(conf=conf) spark = SparkSession(sc) sqlContext = SQLContext(sc)
spark = SparkSession.builder.getOrCreate() dataframe_mysql = spark.read.format("jdbc").options( url="jdbc:mysql://localhost:3306/database_name", driver = "com.mysql.jdbc.Driver", dbtable = "table_name", user="your_username", password="your_password").load() dataframe_mysql.show()
visualization part
I did some basic manipulation in the data extracted from the database like splitting the skills and location data in the required format. I used the Jupyter notebook for visualization. So check the below link for data manipulation using pandas, NumPy and seaborn. Also, I included the basic graph to understand the growth of data scientist growth
Based on the skills
Below the bar graph chart calculated out of 7K+ data scientist jobs. It is clear that Data Science, Big data technologies and Python is more important compared to R and Java language
Based on the location
The location of the companies for the data scientist role is mostly on the Bengaluru region and it contains 34.2% of total location.
Complete code
We came to our favorite part of ready-made code. Use this Github link to download the code
Finally, It is clear that most of the companies are in the Bengaluru region and they are expecting skills of data science, Python and basics of Big data technology.
P.S - In the upcoming projects I'll create an Artificial Intelligence (AI) chatbot along with the Django web framework. I hope everyone is doing fine. Share and comment your thoughts here. Almost six million people die from tobacco use and 2.5 million from
harmful use of alcohol each year worldwide, the World Health
Organization (WHO) reports. Stop consuming alcohol and avoid smoking. Love yourself. Bye.
Awesome content bro.
ReplyDeleteThanks a lot
DeleteWhy you have used Spark? Please clarify on that
ReplyDeleteHere there is no need of using Pyspark. I just tried it to know how it works.
Delete