Smart website category classifier

Presenting by Madhan Kumar Selvaraj

As per the worldometer statistics, each day around 53,61,900 new blogs and websites are created. From that, most of them are adult websites and half of the adult websites are spam sites which contain malware while downloading the contents from them or it will try to get the credentials from us.

Real-world issue

The website URL that we get in the social medial account is not classified whether it is spam or not.
I got a scenario of classifying the Job posting website from the billion websites. But there is no perfect classifier and even those classifiers are proprietary.
Sentiment analysis is done in the comment section of the website or blog. But there is no option of classifying the website URL commented by the user. Sometimes it may contain malware or adult website.

To create the website classifier, we are going to use the Spacy library which is alternative to the Natural Language ToolKit (NLTK) library, machine learning algorithms, Fastapi which faster than the Flask web framework

The workflow of the project

Create a list of website URL's based on the different categories
Scrape the content from the websites and store it in a file
Extract all the content from the file and do text prepossessing using Spacy library
Train those content by using the Multinomial Naive Bayes
If the user add a new website and category, it'll be overwriting the file and train again

Technologies used here

Python language
Spacy pipeline
Machine learning algorithm
Web scrapy
Fastapi
Heroku cloud deployment

Spacy

spaCy acts as a one-stop-shop for various tasks used in NLP projects, such as Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Entity recognition, Dependency parsing, Sentence recognition, Word-to-vector transformations, and other cleaning and normalization text methods. I discussed the technical stuff in the Artificial Intelligent chatbot project.

Fastapi

FastAPI is a modern, fast (high-performance), the web framework for building APIs with Python 3.6+ based on standard Python type hints. The key features are: Fast: Very high performance, on par with NodeJS and Go (thanks to Starlette and Pydantic). One of the fastest Python frameworks available.

Collecting website URL's and their category

We need to train our model to classify the category of the website. For that, we need to feed inputs to our algorithm to train. Here we are going to use few categories like news, adult, space, animals, job posting website. Also, user can able to give their own category and it'll helps our model to work better.

website_category module is used to store the input and output values in the form of dictionary. If the user add new category, the new category values will be updated in the dictionary.

import os
import json
import configuration
news_website = ["https://www.ndtv.com/","https://timesofindia.indiatimes.com/",
        "https://www.indiatoday.in/","https://www.kadaza.in/news",
        "https://www.news18.com/","https://news.google.co.in/",
        "https://www.cnet.com/news/","https://www.timesnownews.com/",
        "https://www.hindustantimes.com/","https://www.bbc.com/news/world"]   
job_website = ["https://www.iimjobs.com/", "https://www.shine.com/",
               "https://www.firstnaukri.com/", "https://www.freshersworld.com/",
               "https://www.linkedin.com/", "https://www.freelancemyway.com/",
               "https://www.indeed.co.in/", "https://www.fresherslive.com/",
               "https://www.jobsarkari.com/", "https://angel.co/jobs" ]           
adult_website = [*****]
technology_website = ["https://techcrunch.com/", "https://gizmodo.com/",
                      "https://www.techhive.com/", "https://www.cbinsights.com/",
                      "https://www.cordcuttersnews.com/", "https://www.makeuseof.com/",
                      "https://lifehacker.com/", "https://www.computerworld.com/in/",
                      "https://www.howtogeek.com/", "https://www.pymnts.com/"]
animal_website = ["http://www.worldanimalnet.org/", "http://animaladay.blogspot.com/",
                  "http://www.animalcorner.co.uk/", "http://www.zooborns.com/",
                  "http://www.arkive.org/education", "http://www.vitalground.org/",
                  "http://www.bestanimalsites.com/", "http://www.kidsplanet.org/",
                  "http://www.wildlifearchives.com/", "http://switchzoo.com/"]             
space_website = ["http://amazing-space.stsci.edu/", "http://www.astroengine.com/",
                 "http://chandra.harvard.edu/", "http://hubblesite.org/", "http://www.spaceweather.com/",
                 "http://www.nineplanets.org/", "http://www.worldwidetelescope.org/",
                 "http://www.kidsastronomy.com/", "http://www.fourmilab.ch/solar/solar.html", "http://www.space.com/"]
site_list_dict = {"news_website": news_website, "adult_website": adult_website, "space_website": space_website, "animal_website":animal_website}

def website_category(url_dict):
    if os.path.exists(configuration.website_category_path):
        os.remove(configuration.website_category_path)     
    file = open(configuration.website_category_path,"a+")
    file.close()          
    file = open(configuration.website_category_path, "r+")
    output = file.read()
    file.close()   
    file = open(configuration.website_category_path, "w+")
    print("output:", output)
    try:
        json.dump(url_dict, file)
    except Exception as error:
        print("Error:", error)
    file.close()
    return "successfully wesbite url dictionary file created"

Now we are going to train the model by using the URL's and their respective category. we are using the Multinomial Naive Bayes machine algorithm. For that we need to give input as well as the output. Here the input is our URL's and the target values are categories. For that, we are going to create the target value categories automatically by using the input URL data.

Input and target values are stored in the file. Here we are going to do two operations.

Training the model using the already defined data

In this case, all the files will be deleted and creating a new file. Contents from the URL's is scrapped from the website by using the web scraping technique. Then the content is filtered by using the Spacy pipeline and the resultant content will be stored in the input file. Similarly, the output file also created to store the target data in the name of the output file.

Training the model by using the user input

In the previous step, we already created input and output files. If the user adds the new category, then the input and output files will be updated and new dictionary key and value pair will be created. If the user using the already available category, no data will be stored in the file.

These are the Python's library that we need to import to apply in our program

import json
import nltk
import spacy
import requests
import numpy as np
from bs4 import BeautifulSoup
from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('words')
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import configuration
import site_data

content_extractor module is used to load the content from the website URL's to the input file and category data to the output. Spacy pipeline is used for the text processing purpose.

check_content = 1
def content_extractor(site_url, target):   
    website = []
    target_data = []   
    if target is None:
        if os.path.exists(configuration.input_file_path):
            os.remove(configuration.input_file_path)        
        if os.path.exists(configuration.output_file_path):
            os.remove(configuration.output_file_path)        
        target_dict = {}
        site_dict_keys = list(site_data.site_list_dict.keys())
        for index, value in enumerate(site_dict_keys):
            target_dict[index+1] = value
        print(target_dict)
        print(site_data.website_category(target_dict))     
        for values in site_url:
            website.extend(values)     
        for index, data in enumerate (site_url):
            target_data.extend([index+1] * len(data))          
    else:
        file = open(configuration.website_category_path, "r+")
        output = file.read()
        website_category_dict = json.loads(output)
        print(website_category_dict)        
        file.close()
        
        key_value = None
        for key, value in website_category_dict.items(): 
            if target == value: 
                key_value = key     
        if key_value is None:       
            file = open(configuration.website_category_path, "w+")
            print("output:", output)
            dict_length = len(website_category_dict)+1
            website_category_dict.update({dict_length : target})
            print("website_category_dict:", website_category_dict)
            json.dump(website_category_dict, file)
            file.close()
            target = dict_length
        else:
            target = key_value      
        print("Target value:", target)
        website.append(site_url)
        target_data.append(target)  
    website_target = dict(zip(website, target_data))
    print("Dictionary:",website_target)
    
    check_content = url_extractor(website_target)
    if ((check_content == 1 and len(website) == 1 ) or len(website)> 1):
        file = open(configuration.input_file_path, "r")
        input_data = []
        for content in file:
            content = content.replace("\n","")
            input_data.append(content)
        file.close()   
        file = open(configuration.output_file_path, "r")
        output_list = []
        for content in file:
            content = content.replace("\n", "")
            output_list.append(int(content))
        file.close()
        output = np.array(output_list)
        output = np.reshape(output_list, (len(output_list),1))  
        xtrain, xtest, ytrain, ytest = train_test_split(input_data, output, test_size = 0.15, random_state = 5)
        vectorizer = TfidfVectorizer()
        voacb_fit = vectorizer.fit(xtrain)    
        xtrain_df = vectorizer.transform(xtrain)
        xtest_df = vectorizer.transform(xtest)   
        pickle.dump(voacb_fit, open(configuration.vocabulary_path, "wb"))   
        model = MultinomialNB()
        model.fit(xtrain_df, ytrain)  
        y_predict = model.predict(xtest_df)
        print(y_predict)
        print(metrics.accuracy_score(ytest, y_predict))
        with open(configuration.classifier_model_path, 'wb') as fid:
            cPickle.dump(model, fid)
        if target is None:
            return "Successfully reloaded the model"
        else:
            return "New category added in our database. Thanks for your valuable contribution"
    else:
        return "Can't add your data due to less content in the site"

url_extractor module is used to extract the content from the website by using the Beautifulsoup python library and used html5lib to extract the html content. Some other module named website_classifier is used to train the data by using the Multinominal Naive Bayes algorithm. I am not adding those coding part here. You can check my GitHub account at the later part of this blog.

I deployed the project in both the local and the cloud environments. For the cloud, I used the Heroku for deployment purpose.

Sample output

The landing page of the website, here we are having information about the API and the rules and technique details for using it.

Initially, we are going to train only with the two categories. One is a news website and another one is a job portal. I created a special API link to reload the API from the beginning. We can use this option if we face any issue.

To know the category of https://news.google.co.in/ google news website. We are going to feed URL to the API, we need to replace the slash (/) with the asterisk (*) symbol. Also, it will extract the top 100 common words from the website.

Now we are going to add a new category. So I am training some adult websites and naming those websites under the adult category.

After getting some accuracy for our new category. Now we'll check the category by giving requests using adult website URL. Accuracy of the model increases by adding more website URLs and categories.

Coding part

The complete coding part of this project is available in my GitHub link.

Local environment deployment

Local environment link

Cloud deployment using Heroku

Cloud deployment link

Heroku cloud deployment link

You can run the project by simply clicking the below link

https://smart-website-classifier.herokuapp.com/

Finally, we created our API and you can able to access this API from another program. That is the specialty of the API. The accuracy of the model increases by feeding new categories and website URLs. You can change this project according to your need.

P.S - Don't waste your valuable time in this Corona pandemic time. Utilize this great time, at most you can. Do the things that you love and don't overthink about your past and future. Your day will come soon. Happy coding!

Search This Blog

Madhan Kumar Selvaraj's blog