Analyzing data scientist growth rate from Naukri website using web scraping

Presenting by Madhan Kumar Selvaraj

What you will get from this blog

Basics of extracting the data from the web page by using the web scrapy technology in Python
Integrating Python with the MySQL database by loading, fetching and manipulation part in a few lines of code (It seems good right)
PySpark is a big data technology for cluster computing framework to work with billion of data by in-memory processing
Basic concepts of Pandas, seaborn and other visualization part.

Basic flow chart of the project

In this blog, we are going to know the current trend of the data scientist role by using the following technology

Python web scrapy
MySql
PySpark
Pandas
Seaborn

This blog is not for those

Who is new to the programming language
Who is not having any idea about the web scrapping
Who are not familiar with the big data technologies

Don't get worried I'll refer a few links to get familiar with the above concepts

Prerequisite

Python 3.X version

Refer my blog where you will get all the reference link to install Python and other dependencies

Scrapy

Install Python scrapy by following the link. Later we'll go through the basics of Scrapy

Anaconda for Jupyter notebook

From this site, you will get Conda, Jupyter notebook, Spyder IDE and many more

Pycharm IDE (Optional)

I personally like this IDE for Python coding. Refer my previous blog

MySQL server

Download MySQL from the official site and select the option of MySQL server while installing

JAVA

PySpark works on the Java environment. So refer to this official site

PySpark

Later on this blog, I'll explain the installation process

Web scrapping

It is the process of extracting the data from the web page using some technology. Here we used the Naukri website to extract the Data scientist job posting data by using the Python Scrapy. I choose this website because it is one of the leading job portals in India and we will get accurate data from here.

Note - We are not using the Naukri website for commercial or money-minded purpose and it is legal for educational purposes. Refer to this link for more.

Start with scrapy

I don't want to explain more about the basics, instead I'll concentrate on the scraping techniques here. Refer the link to get some basics about the scrapy and also create some basics web spider to get familiar with it.
Open a path where you are creating your project and run the below command to create a spider where I used the name ds_growth_rate

scrapy startproject ds_growth_rate

You will get output like below and scrapy will automatically create the necessary files

Before doing the coding part, we need to understand how the Naukri website is loading its data and how they are processing it. We have an option in Scrapy to load the website and we can test their response. Now I am loading the Naukri's data scientist job posting webpage in the Scrapy shell.

scrapy shell "https://www.naukri.com/data-scientist-jobs"

Scrapy will load your request and you will get output like below

Now run the "view response" command to view the response fetched by the scrapy and in the browser you can view the data

It is clear that the Naukri website is checking where the user is human or robot. That's why they are checking our request using the Captcha. Now we need to analyze the website more to extract their data.
Here I inspected the page in Firefox application by selecting the Inspect element option while clicking the right click on the second page on the Naukri website. Then opened the Network to check the request and response handled by the website. Below I annotated the image for the reference

Here I found that the Naukri website using body data in the name of params to perform the POST method. So I tested it with the POSTMAN application to have a detailed view.

At last, we found a way to extract the actual data from the website. Now we need to extract the required data from the response HTML by using either Xpath, CSS or by the regular expression.
Tips- Nowadays, the website will block our IP address when we send too many requests. I personally prefer to load the HTML file in my local machine and then I will apply all my analysis and then I'll hit the actual website server.
Here I am storing the response from the website in the HTML format by requesting the POST method. Save this file under the spider folder. D:\Project\ds_growth_rate\ds_growth_rate\spiders This is the path of the spider folder in my system.

import scrapy
import json

class QuotesSpider(scrapy.Spider):
    name = "naukri"  # Spider name
    body_data = "qp=data+scientist&ql=bangalore&qe=&qm=&qx=&qi%5B%5D=&qf%5B%5D=&qr%5B%5D=&qs=r&qo=&qjt%5B%5D=&qk%5B%5D=&qwdt=&qsb_section=home&qpremTagLabel=&sid=15810530081808&qwd%5B%5D=&qcf%5B%5D=&qci%5B%5D=&qck%5B%5D=&edu%5B%5D=&qcug%5B%5D=&qcpg%5B%5D=&qctc%5B%5D=&qco%5B%5D=&qcjt%5B%5D=&qcr%5B%5D=&qctags%5B%5D=&qcl%5B%5D=&qrefresh=&xt=adv&qtc%5B%5D=&fpsubmiturl=https%3A%2F%2Fwww.naukri.com%2Fdata-scientist-jobs-in-bangalore&qlcl%5B%5D=&latLong="
    def start_requests(self):
        url = "https://www.naukri.com/data-scientist-jobs"
        yield scrapy.Request(url=url, method='POST',body=self.body_data, callback=self.parse)

    def parse(self, response):
        file_data = open("naukri_file.html", "wb")
        file_data.write(response.body)
        file_data.close()

Here our spider name is "naukri" and you can check in the second line of the code. Run the below command to execute the spider. "naukri_file.html" is the file name.

scrapy crawl naukri

The response file will be stored in the HTML format under the spiders folder. To execute the file run the below command. Change the file path according to your machine.

scrapy shell file:///D:/Project/ds_growth_rate/ds_growth_rate/spiders/naukri_file.html

Now we will use the XPath to find the posting data details like company name, experience, skills, location and many more. Let we will extract the company name of the first posting mentioned in the below image

Here is the Xpath to extract the company name

response.xpath('.//span[@class="org"]/text()').extract_first()

I extracted a few data by using Xpath and populated it in the below image. Refer to this link to get familiar with Xpath. I used the only Xpath if you want you can use CSS and regular expressions also.

Loading scrapped data into MySQL database

I hope you installed MySQL server in your machine and also you need to install MySQL connector by using the below command using PIP

pip install mysql-connector-python

Have a walk through in MySQL in Python concepts for those who are not familiar with it
I created MySQL database name "ds_growth" and table name as "ds_tables". Create it by using the above-mentioned tutorial.

Complete code for web scraping loading data into database

I tried to explain clearly by commands in each code. I hope you will understand.

from .ds_growth_db import insert_data  # Impoting insert_data function to load data into database
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "dsgrowth"   # Spider name
    minimum_experience = 'NA'  # Default value if there is no value while scraping
    maximum_experience = 'NA'
    next_page = []
    # Body data for the POST method
    body_data = "qp=data+scientist&ql=bangalore&qe=&qm=&qx=&qi%5B%5D=&qf%5B%5D=&qr%5B%5D=&qs=r&qo=&qjt%5B%5D=&qk%5B%5D=&qwdt=&qsb_section=home&qpremTagLabel=&sid=15810530081808&qwd%5B%5D=&qcf%5B%5D=&qci%5B%5D=&qck%5B%5D=&edu%5B%5D=&qcug%5B%5D=&qcpg%5B%5D=&qctc%5B%5D=&qco%5B%5D=&qcjt%5B%5D=&qcr%5B%5D=&qctags%5B%5D=&qcl%5B%5D=&qrefresh=&xt=adv&qtc%5B%5D=&fpsubmiturl=https%3A%2F%2Fwww.naukri.com%2Fdata-scientist-jobs-in-bangalore&qlcl%5B%5D=&latLong="
    
    # Our program starts in this function
    def start_requests(self):
        url = "https://www.naukri.com/data-scientist-jobs" # Main URL
        yield scrapy.Request(url=url, method='POST',body=self.body_data, callback= self.parse)

    def parse(self, response):
        # To handle exceptions
        try:
            dict_data = {'key_data': []}  # Used to load data into database effectively
            for row_data in response.css('div.row  '):  # To extract all the jobs from a page
                designation = row_data.xpath('.//li[@class="desig"]//text()').extract_first()
                if designation is not None:  # Checking the value
                    company = row_data.xpath('.//span[@class="org"]/text()').extract_first()
                    salary = row_data.xpath('.//span[@class="salary"]//text()').extract_first()
                    skill = row_data.xpath('//span[@class="skill"]//text()').extract_first()
                    posted_on = row_data.xpath('.//span[@class="date"]//text()').extract_first()
                    experience = row_data.xpath('.//span[@class="exp"]//text()').extract_first()
                    location = row_data.xpath('.//span[@class="loc"]//text()').extract_first()
                    salary = "NAN" if (salary.strip() == "Not disclosed" or salary.strip() == "") else salary.strip()
                    # Splitting experience value into min and max
                    # Strip used to remove the unwanted spaces
                    if experience is not None:
                        split_result = experience.strip('yrs').split('-') if "-" in experience else experience
                        if len(split_result) > 1:
                            self.minimum_experience = (split_result[0]).strip()
                            self.maximum_experience = (split_result[1]).strip()
                    extracted_data =(designation.strip(), company.strip(), skill.strip(), salary.strip(), posted_on.strip(), location.strip(), self.minimum_experience, self.maximum_experience)
                    # To convert values into list of tuples
                    dict_data['key_data'].extend([extracted_data])
            # Insert the extracted data into the database
            insert_data(dict_data['key_data'])
            self.next_page = response.xpath('//div[@class="pagination"]/a//@href').extract()
            self. next_page = self. next_page[-1]
            
            
            # Iterate to the next page to scrape the data
            if self.next_page is not None:
                yield scrapy.Request(self.next_page,  method='POST',body=self.body_data, callback=self.parse)


        except Exception as e:
            print(e)

The below file is used to load data into the MySQL database and save it in the name "ds_growth_db". Use your database username, password, database name, table name

import mysql.connector

# MySQL DB connector
mydb = mysql.connector.connect(
        host="localhost",
        user="your_username",
        passwd="your_password",
        database="database_name")
mycursor = mydb.cursor()


# To create database and tables
def create_database():
    mycursor.execute("CREATE DATABASE database_name")
    mycursor.execute("CREATE TABLE table_name (designation VARCHAR(255), company VARCHAR(255), skill VARCHAR(300), salary VARCHAR(255), posted_on VARCHAR(255), location VARCHAR(255), min_experience VARCHAR(255),max_experience VARCHAR(255))")
    # To view the created tables
    mycursor.execute("SHOW TABLES")
    for x in mycursor:
      print(x)

def insert_data(extracted_data):
    sql = """INSERT INTO table_name (designation, company, skill, salary, posted_on, location, min_experience, max_experience) VALUES  (%s, %s, %s, %s, %s, %s, %s, %s)"""
    mycursor.executemany(sql, extracted_data)
    mydb.commit()
    print(mycursor.rowcount, "was inserted.")

# To check the loaded data 
def check():
    mycursor.execute("SELECT * FROM table_name")
    myresult = mycursor.fetchall()
    for x in myresult:
        print(x)


# Call these functions only if it is needed
# create_database()
# insert_data()
# check()

Execute the program by using the below command

scrapy crawl naukri

Successfully I extracted 7107 job postings from the Naukri website and also printed top values from the database

We completed the scrapping part and loading the data into the database. Now we'll use the Pyspark to fetch the data and do some basic visualization
Note - I thought to perform Machine Learning algorithms like Linear regression and KNN in this project. Unfortunately, all the values in the Salary column are NULL. So I'll perform some basic visualization by using the Skills and location data.

PySpark

Install Java and Anaconda in your machine if you are not installed till now. Follow the below links to install PySpark

After linking PySpark with the Jupyter notebook, run the below command in the Jupyter notebook to test the Spark

import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()

You'll get the below result if you installed Pyspark correctly. Or else follow the previous links to install properly in your machine

Now we need to fetch the data from the MySQL database. I spend a lot of time here to connect the DB with the PySpark. First, download the MySQL connector from this link. After extracting the file you will have two files. One is the bin jar file and another jar file. Copy those files under the spark/jars folder. Mine is under the path C:\Users\S\Spark\jars.
Run the below commands to link PySpark with the MySQL database. Use your database name, username, password and table name. Change it int he below code

import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
conf = pyspark.SparkConf().setAppName('appName').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
sqlContext = SQLContext(sc)

spark = SparkSession.builder.getOrCreate()
dataframe_mysql = spark.read.format("jdbc").options(
    url="jdbc:mysql://localhost:3306/database_name",
    driver = "com.mysql.jdbc.Driver",
    dbtable = "table_name",
    user="your_username",
    password="your_password").load()

dataframe_mysql.show()

visualization part

I did some basic manipulation in the data extracted from the database like splitting the skills and location data in the required format. I used the Jupyter notebook for visualization. So check the below link for data manipulation using pandas, NumPy and seaborn. Also, I included the basic graph to understand the growth of data scientist growth

https://github.com/Madhan-kumar-selvaraj/Data_scientist_growth/blob/master/ds_growth_pyspark.ipynb

Based on the skills

Below the bar graph chart calculated out of 7K+ data scientist jobs. It is clear that Data Science, Big data technologies and Python is more important compared to R and Java language

Based on the location

The location of the companies for the data scientist role is mostly on the Bengaluru region and it contains 34.2% of total location.

Complete code

We came to our favorite part of ready-made code. Use this Github link to download the code

Finally, It is clear that most of the companies are in the Bengaluru region and they are expecting skills of data science, Python and basics of Big data technology.

P.S - In the upcoming projects I'll create an Artificial Intelligence (AI) chatbot along with the Django web framework. I hope everyone is doing fine. Share and comment your thoughts here. Almost six million people die from tobacco use and 2.5 million from harmful use of alcohol each year worldwide, the World Health Organization (WHO) reports. Stop consuming alcohol and avoid smoking. Love yourself. Bye.

Search This Blog

Madhan Kumar Selvaraj's blog