Review-driven recommendation
Presenting by Madhan Kumar Selvaraj
After a very long time, I am blogging about new project named Review-driven recommendation. In this Artificial Intelligence (AI) era, data is a called as new gold. Now it's very common to buy any products and choosing a restaurant to eat we are checking the other user reviews. Once we are satisfied with their review we'll choose that particular product.
Real-world issue
- Let's assume we are planning to buy an ABC brand model mobile. But the product review is available at many platforms like Google reviews, MouthShut, yelp and lot more.
- We need to check all the reviews from the users and from different platforms to get most of it. It'll take more time and effort.
To fix this issue, we are extracting user reviews from all the platforms and using some NLP models to get the top best products list with some visualization.
Project roadmap
The workflow of the project
- Scraping user reviews from the multiple platforms
- Extracted data are combined and stored in the JSON format
- Preprocessing the extracted data and applying spacy model
- Result data are visualized to get the best products
Technologies used here
- Python
- web scraping
- pandas
- NLTK
- Spacy
- Plotly
Review-driven recommendation
Scraping the data
- To scrape the reviews from the Google reviews, we need some pre-requisites for the web scraping that you can refer from my previous blogs
- We are having 2 parts in scraping
- Getting top suggestion from the Google Maps
- Using the suggestion data, we are scraping its data
- Below code helps to get the top recommendation from the Google Map
def start_requests(self):
self.params["q"] = f"{self.search_for} at {self.place}"
response = requests.get('https://www.google.com/maps', params=self.params, headers=self.headers)
if response.ok and response.content:
tree = html.fromstring(response.content)
try:
rows_block = tree.xpath('//link[@rel="shortcut icon"]/following-sibling::script[1]')[0].text
results_list = re.findall(r'''"SearchResult.TYPE_.*?],\\+"(.*?)\\+"''', rows_block)
print(results_list)
if results_list:
self.review_requests([results_list[1]])
if len(self.all_reviews) > 0:
filename = f"{self.output_filename}_{str(datetime.date.today())}.json"
if not os.path.exists(self.output_folder_name):
os.mkdir(self.output_folder_name)
if os.path.exists(f"{self.output_folder_name}/{filename}"):
os.remove(f"{self.output_folder_name}/{filename}")
with open(f"{self.output_folder_name}/{filename}", "a+") as new_file:
for each_row in self.all_reviews:
new_file.write(json.dumps(each_row))
new_file.write("\n")
print(f"Total number of reviews fetched across all the links is {len(self.all_reviews)}")
except Exception as error:
print(error)
- Similar to Google review, MouthShut reviews also scraped and stored in the JSON format and those code will be available at Github.
Text preprocessing
- Import all the necessary libraries that required to preprocess the text
import nltk
import spacy
import os
import pandas as pd
import contractions
nltk.download("words")
from datetime import datetime
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
- We are not using any in-built data to train our model. Instead, we'll use a set of positive and negative words to preprocess the extracted reviews
- positive_words={'absolutely','accepted','acclaimed','accomplished','accomplishment','admirable','amazing','authentic','abundant','abundance','achiever','affluent','astonishing','awesome','adorable','beaming','beauty','beautiful','believe','beloved','beneficial','bestfriend','blessed','bliss','brave','breathtaking','bright','brilliant','brilliance','bravo','calm','calming','capable','captivating','caring','celebrate','certain','champ','champion','charitable','charm','charming','comfortable','committed','compassion','compassionate','confident','congratulations','correct','courage','courageous','courteous','creative','creativity','credible','credibility','daring','darling','dazzling','dear','dearest','dedicated','delicious','delight','delightful','....'}
- negative_words={'abrasive','apathetic','controlling','dishonest','impatient','anxious','betrayed','disappointed','embarrassed','jealous','abysmal','bad','callous','corrosive','damage','despicable','donot','enraged','fail','gawky','haggard','hurt','icky','insane','jealous','lose','malicious','naive','not','objectionable','pain','questionable','reject','rude','sad','sinister','stuck','tense','ugly','unsightly','vice','wary','yell','zero','adverse','banal','cannot','corrupt','damaging','detrimental','dreadful','eroding','faulty','ghastly','hard','hurtful','ignorant','insidious','junky','lousy','mean','nasty','noxious','odious','perturb','....'}
- Some preprocessing steps includes
- dropping unwanted columns
- Converting the ratings greater than 3 as positive and remaining as negative
- Expanding the words like don't to donot by using the spacy library
df_dropped = dataframe.drop(
["user", "review_date", "images", "extraction_date", "review_availability", "address", "url", 'name'], axis=1)
df_filter = df_dropped.copy()
df_filter = df_filter[df_filter.review_text != ""]
df_filter["rating"] = df_dropped["rating"].str.replace(" out of 5,", "").str.replace("Rated ", "")
df_filter.rating = df_filter.rating.astype(float).astype(int)
df_filter["rating"].value_counts()
df_filter["bi_ratings"] = df_filter["rating"].apply(lambda x: "Positive" if x > 3.0 else "Negative")
df_filter.bi_ratings.value_counts()
df_filter = df_filter[df_filter.columns[[3, 1, 2, 0, 4]]]
df_filter["contractions_review"] = df_filter["review_text"].apply(lambda x: contractions.fix(x))
- By using our own model, we are grouping the positive and negative words from the reviews together, and we are finding some insights on it.
for hospital_name, group_df in hospital_name_groupby:
try:
temp = group_df[group_df['positive_words'].str.len() != 0].copy()
temp['positive_words'] = temp['positive_words'].apply(lambda x: ",".join(list(filter(None, x))))
words_df = temp.groupby(level='rating')['positive_words'].apply(','.join).reset_index()
temp = group_df[group_df['negative_words'].str.len() != 0]
temp['negative_words'] = temp['negative_words'].apply(lambda x: ",".join(list(filter(None, x))))
words_df["negative_words"] = temp.groupby('rating')['negative_words'].apply(','.join).reset_index()[
"negative_words"]
words_df1 = words_df.assign(negative_words=words_df.negative_words.str.split(","))
words_df1 = words_df1.assign(positive_words=words_df1.positive_words.str.split(","))
words_df1.set_index("rating", inplace=True)
positive_df = pd.DataFrame(words_df1["positive_words"].dropna()).loc[:, "positive_words"].apply(
lambda x: pd.Index(x).value_counts().nlargest(10))
negative_df = pd.DataFrame(words_df1["negative_words"].dropna()).loc[:, "negative_words"].apply(
lambda x: pd.Index(x).value_counts().nlargest(10))
positive_df.fillna(0, inplace=True)
negative_df.fillna(0, inplace=True)
positive_df = positive_df.T
negative_df = negative_df.T
positive_df = positive_df.loc[:, positive_df.columns.isin([4, 5])]
positive_df = pd.DataFrame(positive_df.sum(axis=1))
positive_df.reset_index(inplace=True)
positive_df.rename(columns={0: "value", "index": "reviews"}, inplace=True)
positive_df = positive_df[positive_df["value"] > 5].sort_values(by='value', ascending=False)
positive_df = positive_df[positive_df["reviews"] != ""]
positive_df.reset_index(drop=True, inplace=True)
negative_df = negative_df.loc[:, negative_df.columns.isin([1, 2, 3])]
negative_df = pd.DataFrame(negative_df.sum(axis=1))
negative_df.reset_index(inplace=True)
negative_df.rename(columns={0: "value", "index": "reviews"}, inplace=True)
negative_df = negative_df[negative_df["value"] > 5].sort_values(by='value', ascending=False)
negative_df = negative_df[negative_df["reviews"] != ""]
negative_df.reset_index(drop=True, inplace=True)
if not positive_df.empty and positive_df.shape[0] > 3 and not negative_df.empty and negative_df.shape[0] > 0:
print("\n", hospital_name, "\n")
subplot_fig.add_trace(
go.Pie(labels=positive_df.reviews, values=positive_df.value, textinfo='label+value+percent',
insidetextorientation='radial'), row=1, col=1)
subplot_fig.add_trace(
go.Pie(labels=negative_df.reviews, values=negative_df.value, textinfo='label+value+percent',
insidetextorientation='radial'), row=1, col=2)
# subplot_fig.show()
subplot_fig.write_image(f"{hospital_name}.png")
elif not positive_df.empty and positive_df.shape[0] > 3:
fig = go.Figure(
data=[go.Pie(labels=positive_df.reviews, values=positive_df.value, textinfo='label+value+percent',
insidetextorientation='radial', title=f"{hospital_name} -> Positive reviews")])
# fig.show()
fig.write_image(f"{hospital_name}.png")
elif not negative_df.empty and negative_df.shape[0] > 3:
fig = go.Figure(
data=[go.Pie(labels=negative_df.reviews, values=negative_df.value, textinfo='label+value+percent',
insidetextorientation='radial', title=f"{hospital_name} -> Negative reviews")])
# fig.show()
fig.write_image(f"{hospital_name}.png")
except:
pass
Sample visualization report
Ready-made code
Code for this project available in the below link
Finally, we created our own Review-driven recommendation application. Still, there is a lot of improvement needed
in this application. Make this project as a reference and update it as
per your taste. Happy coding!
Great work
ReplyDelete