Review-driven recommendation

Review-driven recommendation

- June 20, 2023

Presenting by Madhan Kumar Selvaraj

After a very long time, I am blogging about new project named Review-driven recommendation. In this Artificial Intelligence (AI) era, data is a called as new gold. Now it's very common to buy any products and choosing a restaurant to eat we are checking the other user reviews. Once we are satisfied with their review we'll choose that particular product.

Real-world issue

Let's assume we are planning to buy an ABC brand model mobile. But the product review is available at many platforms like Google reviews, MouthShut, yelp and lot more.
We need to check all the reviews from the users and from different platforms to get most of it. It'll take more time and effort.

To fix this issue, we are extracting user reviews from all the platforms and using some NLP models to get the top best products list with some visualization.

Project roadmap

The workflow of the project

Scraping user reviews from the multiple platforms
Extracted data are combined and stored in the JSON format
Preprocessing the extracted data and applying spacy model
Result data are visualized to get the best products

Technologies used here

Python
web scraping
pandas
NLTK
Spacy
Plotly

Review-driven recommendation

Scraping the data

To scrape the reviews from the Google reviews, we need some pre-requisites for the web scraping that you can refer from my previous blogs
We are having 2 parts in scraping

Getting top suggestion from the Google Maps

Using the suggestion data, we are scraping its data

Below code helps to get the top recommendation from the Google Map

def start_requests(self):
    self.params["q"] = f"{self.search_for} at {self.place}"
    response = requests.get('https://www.google.com/maps', params=self.params, headers=self.headers)
    if response.ok and response.content:
        tree = html.fromstring(response.content)
        try:
            rows_block = tree.xpath('//link[@rel="shortcut icon"]/following-sibling::script[1]')[0].text
            results_list = re.findall(r'''"SearchResult.TYPE_.*?],\\+"(.*?)\\+"''', rows_block)
            print(results_list)
            if results_list:
                self.review_requests([results_list[1]])
            if len(self.all_reviews) > 0:
                filename = f"{self.output_filename}_{str(datetime.date.today())}.json"
                if not os.path.exists(self.output_folder_name):
                    os.mkdir(self.output_folder_name)
                if os.path.exists(f"{self.output_folder_name}/{filename}"):
                    os.remove(f"{self.output_folder_name}/{filename}")
                with open(f"{self.output_folder_name}/{filename}", "a+") as new_file:
                    for each_row in self.all_reviews:
                        new_file.write(json.dumps(each_row))
                        new_file.write("\n")
            print(f"Total number of reviews fetched across all the links is {len(self.all_reviews)}")
        except Exception as error:
            print(error)

Similar to Google review, MouthShut reviews also scraped and stored in the JSON format and those code will be available at Github.

Text preprocessing

Import all the necessary libraries that required to preprocess the text

import nltk
import spacy
import os
import pandas as pd
import contractions

nltk.download("words")
from datetime import datetime
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.tokenize.treebank import TreebankWordDetokenizer

We are not using any in-built data to train our model. Instead, we'll use a set of positive and negative words to preprocess the extracted reviews
positive_words={'absolutely','accepted','acclaimed','accomplished','accomplishment','admirable','amazing','authentic','abundant','abundance','achiever','affluent','astonishing','awesome','adorable','beaming','beauty','beautiful','believe','beloved','beneficial','bestfriend','blessed','bliss','brave','breathtaking','bright','brilliant','brilliance','bravo','calm','calming','capable','captivating','caring','celebrate','certain','champ','champion','charitable','charm','charming','comfortable','committed','compassion','compassionate','confident','congratulations','correct','courage','courageous','courteous','creative','creativity','credible','credibility','daring','darling','dazzling','dear','dearest','dedicated','delicious','delight','delightful','....'}
negative_words={'abrasive','apathetic','controlling','dishonest','impatient','anxious','betrayed','disappointed','embarrassed','jealous','abysmal','bad','callous','corrosive','damage','despicable','donot','enraged','fail','gawky','haggard','hurt','icky','insane','jealous','lose','malicious','naive','not','objectionable','pain','questionable','reject','rude','sad','sinister','stuck','tense','ugly','unsightly','vice','wary','yell','zero','adverse','banal','cannot','corrupt','damaging','detrimental','dreadful','eroding','faulty','ghastly','hard','hurtful','ignorant','insidious','junky','lousy','mean','nasty','noxious','odious','perturb','....'}

Some preprocessing steps includes

dropping unwanted columns
Converting the ratings greater than 3 as positive and remaining as negative
Expanding the words like don't to donot by using the spacy library

df_dropped = dataframe.drop(
    ["user", "review_date", "images", "extraction_date", "review_availability", "address", "url", 'name'], axis=1)
df_filter = df_dropped.copy()
df_filter = df_filter[df_filter.review_text != ""]

df_filter["rating"] = df_dropped["rating"].str.replace(" out of 5,", "").str.replace("Rated ", "")
df_filter.rating = df_filter.rating.astype(float).astype(int)
df_filter["rating"].value_counts()
df_filter["bi_ratings"] = df_filter["rating"].apply(lambda x: "Positive" if x > 3.0 else "Negative")
df_filter.bi_ratings.value_counts()
df_filter = df_filter[df_filter.columns[[3, 1, 2, 0, 4]]]
df_filter["contractions_review"] = df_filter["review_text"].apply(lambda x: contractions.fix(x))

By using our own model, we are grouping the positive and negative words from the reviews together, and we are finding some insights on it.

for hospital_name, group_df in hospital_name_groupby:
    try:
        temp = group_df[group_df['positive_words'].str.len() != 0].copy()
        temp['positive_words'] = temp['positive_words'].apply(lambda x: ",".join(list(filter(None, x))))
        words_df = temp.groupby(level='rating')['positive_words'].apply(','.join).reset_index()
        temp = group_df[group_df['negative_words'].str.len() != 0]
        temp['negative_words'] = temp['negative_words'].apply(lambda x: ",".join(list(filter(None, x))))
        words_df["negative_words"] = temp.groupby('rating')['negative_words'].apply(','.join).reset_index()[
            "negative_words"]
        words_df1 = words_df.assign(negative_words=words_df.negative_words.str.split(","))
        words_df1 = words_df1.assign(positive_words=words_df1.positive_words.str.split(","))
        words_df1.set_index("rating", inplace=True)
        positive_df = pd.DataFrame(words_df1["positive_words"].dropna()).loc[:, "positive_words"].apply(
            lambda x: pd.Index(x).value_counts().nlargest(10))
        negative_df = pd.DataFrame(words_df1["negative_words"].dropna()).loc[:, "negative_words"].apply(
            lambda x: pd.Index(x).value_counts().nlargest(10))
        positive_df.fillna(0, inplace=True)
        negative_df.fillna(0, inplace=True)
        positive_df = positive_df.T
        negative_df = negative_df.T
        positive_df = positive_df.loc[:, positive_df.columns.isin([4, 5])]
        positive_df = pd.DataFrame(positive_df.sum(axis=1))
        positive_df.reset_index(inplace=True)
        positive_df.rename(columns={0: "value", "index": "reviews"}, inplace=True)
        positive_df = positive_df[positive_df["value"] > 5].sort_values(by='value', ascending=False)
        positive_df = positive_df[positive_df["reviews"] != ""]
        positive_df.reset_index(drop=True, inplace=True)
        negative_df = negative_df.loc[:, negative_df.columns.isin([1, 2, 3])]
        negative_df = pd.DataFrame(negative_df.sum(axis=1))
        negative_df.reset_index(inplace=True)
        negative_df.rename(columns={0: "value", "index": "reviews"}, inplace=True)
        negative_df = negative_df[negative_df["value"] > 5].sort_values(by='value', ascending=False)
        negative_df = negative_df[negative_df["reviews"] != ""]
        negative_df.reset_index(drop=True, inplace=True)
        if not positive_df.empty and positive_df.shape[0] > 3 and not negative_df.empty and negative_df.shape[0] > 0:
            print("\n", hospital_name, "\n")
            subplot_fig.add_trace(
                go.Pie(labels=positive_df.reviews, values=positive_df.value, textinfo='label+value+percent',
                       insidetextorientation='radial'), row=1, col=1)
            subplot_fig.add_trace(
                go.Pie(labels=negative_df.reviews, values=negative_df.value, textinfo='label+value+percent',
                       insidetextorientation='radial'), row=1, col=2)
            # subplot_fig.show()
            subplot_fig.write_image(f"{hospital_name}.png")
        elif not positive_df.empty and positive_df.shape[0] > 3:
            fig = go.Figure(
                data=[go.Pie(labels=positive_df.reviews, values=positive_df.value, textinfo='label+value+percent',
                             insidetextorientation='radial', title=f"{hospital_name} -> Positive reviews")])
            # fig.show()
            fig.write_image(f"{hospital_name}.png")
        elif not negative_df.empty and negative_df.shape[0] > 3:
            fig = go.Figure(
                data=[go.Pie(labels=negative_df.reviews, values=negative_df.value, textinfo='label+value+percent',
                             insidetextorientation='radial', title=f"{hospital_name} -> Negative reviews")])
            # fig.show()
            fig.write_image(f"{hospital_name}.png")
    except:
        pass

Sample visualization report

Ready-made code

Code for this project available in the below link

Github link

Finally, we created our own Review-driven recommendation application. Still, there is a lot of improvement needed in this application. Make this project as a reference and update it as per your taste. Happy coding!

Comments

AnonymousJune 21, 2023 at 4:07 AM
Great work
ReplyDelete
Replies

Post a Comment