Creating Environment for the project

  • Creating conda environment for the project.

    conda create -n book python=3.7 -y

  • Activating the conda environment.

    conda activate book

  • Installing required libraries

    pip install pandas

    pip install numpy

    pip install scipy

    pip install scikit-learn

  • Dataset :- https://www.kaggle.com/ra4u12/bookrecommendation

Preprocessing the book dataset

import pandas as pd
import numpy as np
import pickle
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
books = pd.read_csv('data/books.csv', sep=";", error_bad_lines=False, encoding='latin-1')
books.sample(2)
ISBN Book-Title Book-Author Year-Of-Publication Publisher Image-URL-S Image-URL-M Image-URL-L
141793 186508400X Follow the Blue Brigid Lowry 2001 Allen & Unwin (Australia) Pty Ltd http://images.amazon.com/images/P/186508400X.0... http://images.amazon.com/images/P/186508400X.0... http://images.amazon.com/images/P/186508400X.0...
238570 078712043X Cat Who Tailed a Thief (Cat Who... (Audio)) Lillian Jackson Braun 1999 Audio Literature http://images.amazon.com/images/P/078712043X.0... http://images.amazon.com/images/P/078712043X.0... http://images.amazon.com/images/P/078712043X.0...
books.iloc[237]['Image-URL-L']
'http://images.amazon.com/images/P/0671027387.01.LZZZZZZZ.jpg'
books.shape
(271360, 8)
books.columns
Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')
  • Here we are selecting the most important columns and also selecting Image URL columns for the poster.
books = books[['ISBN','Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher','Image-URL-L']]
books.sample(5)
ISBN Book-Title Book-Author Year-Of-Publication Publisher Image-URL-L
10488 0515085995 The Mick Mickey Mantle 1986 Jove Books http://images.amazon.com/images/P/0515085995.0...
149036 3453189876 So wahr mir Gott helfe. John T. Lescroart 2001 Heyne http://images.amazon.com/images/P/3453189876.0...
66324 0441005829 The Lady in the Loch Elizabeth Ann Scarborough 1998 Ace Books http://images.amazon.com/images/P/0441005829.0...
225330 0395171709 Field Guide to Shells in Atlantic Gulf Coast Percy A. Morris 1973 Houghton Mifflin Company http://images.amazon.com/images/P/0395171709.0...
13227 1551669234 Stonebrook Cottage Carla Neggers 2002 Mira http://images.amazon.com/images/P/1551669234.0...
books.rename(columns={"Book-Title":'title',
                      'Book-Author':'author',
                     "Year-Of-Publication":'year',
                     "Publisher":"publisher",
                     "Image-URL-L":"image_url"},inplace=True)
books.sample(5)
ISBN title author year publisher image_url
99794 0590404261 Teacher's Pet (Couples, No 21) M.E. Cooper 1987 Scholastic Paperbacks (Mm) http://images.amazon.com/images/P/0590404261.0...
246698 0590250825 Revenge (Nightmare Hall, No 26) Diane Hoh 1995 Scholastic Paperbacks http://images.amazon.com/images/P/0590250825.0...
82550 840309289X Cuando Comer es un Infierno: Confesiones de un... Espido Freire 2002 Aguilar http://images.amazon.com/images/P/840309289X.0...
247672 0373115008 Tiger Moon (Harlequin Presents, 1500) Kristy McCallum 1992 Harlequin http://images.amazon.com/images/P/0373115008.0...
182015 0671509500 Chubby Tugboat N. Bradley Omar 1984 Simon & Schuster Merchandise & http://images.amazon.com/images/P/0671509500.0...

Preprocessing user csv file

users = pd.read_csv('./data/user.csv', sep=";", error_bad_lines=False, encoding='latin-1')
users.sample(5)
User-ID Location Age
48171 48172 rockville, maryland, usa NaN
84979 84980 somerville, new jersey, usa NaN
214199 214200 magnolia, texas, usa NaN
76944 76945 proserpine, queensland, australia 64.0
125599 125600 ferrara, emilia romagna, italy NaN
users.shape
(278858, 3)
users.rename(columns={"User-ID":'user_id',
                      'Location':'location',
                     "Age":'age'},inplace=True)
users.sample(5)
user_id location age
236446 236447 victoria, british columbia, canada NaN
59431 59432 london, england, united kingdom 26.0
73686 73687 san diego, california, vietnam NaN
188154 188155 la mesa, california, usa 19.0
256098 256099 laguna beach, california, usa NaN

Preprocessing bookrating csv file

ratings = pd.read_csv('./data/bookrating.csv', sep=";", error_bad_lines=False, encoding='latin-1')
ratings.sample(5)
User-ID ISBN Book-Rating
437318 104636 0515130044 0
635481 153662 0517122057 0
1114788 267354 0671795538 8
43959 11601 0425075826 0
747930 180945 0786867876 0
ratings.shape
(1149780, 3)
ratings.rename(columns={"User-ID":'user_id',
                      'Book-Rating':'rating'},inplace=True)
ratings.sample(5)
user_id ISBN rating
134434 30744 0373835116 0
877603 212835 0451201515 0
1085640 259711 0698119320 0
207867 48025 0743444507 9
346741 82893 006096491X 0
  • Now we have 3 dataframes
    • books
    • users
    • ratings
print(books.shape, users.shape, ratings.shape, sep='\n')
(271360, 6)
(278858, 3)
(1149780, 3)
ratings['user_id'].value_counts()
11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
116180        1
116166        1
116154        1
116137        1
276723        1
Name: user_id, Length: 105283, dtype: int64
ratings['user_id'].value_counts().shape
(105283,)
ratings['user_id'].unique().shape
(105283,)
x = ratings['user_id'].value_counts() > 200
x[x].shape
(899,)
y= x[x].index
y
Int64Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352,
            110973, 235105,
            ...
            260183,  73681,  44296, 155916,   9856, 274808,  28634,  59727,
            268622, 188951],
           dtype='int64', length=899)
ratings = ratings[ratings['user_id'].isin(y)]
ratings.head()
user_id ISBN rating
1456 277427 002542730X 10
1457 277427 0026217457 0
1458 277427 003008685X 8
1459 277427 0030615321 0
1460 277427 0060002050 0
ratings.shape
(526356, 3)
ratings_with_books = ratings.merge(books, on='ISBN')
ratings_with_books.head()
user_id ISBN rating title author year publisher image_url
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0...
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0...
2 11676 002542730X 6 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0...
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0...
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0...
ratings_with_books.shape
(487671, 8)
number_rating = ratings_with_books.groupby('title')['rating'].count().reset_index()
number_rating.sample(5)
title rating
38665 El conocimiento secreto de Jesucristo 1
22915 Charles & Diana : Prince & Princess Of... 1
6907 After the Cabaret 1
106791 Smoke Jumper (Choose Your Own Adventure, No 111) 2
110820 Stress Remedies: Hundreds of Fast-Relief Tips ... 1
number_rating.rename(columns={'rating':'num_of_rating'},inplace=True)
number_rating.head()
title num_of_rating
0 A Light in the Storm: The Civil War Diary of ... 2
1 Always Have Popsicles 1
2 Apple Magic (The Collector's series) 1
3 Beyond IBM: Leadership Marketing and Finance ... 1
4 Clifford Visita El Hospital (Clifford El Gran... 1
final_rating = ratings_with_books.merge(number_rating, on='title')
final_rating.head()
user_id ISBN rating title author year publisher image_url num_of_rating
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
2 11676 002542730X 6 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
final_rating.shape
(487671, 9)
final_rating = final_rating[final_rating['num_of_rating'] >= 50]
final_rating.head()
user_id ISBN rating title author year publisher image_url num_of_rating
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
2 11676 002542730X 6 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... James Finn Garner 1994 John Wiley & Sons Inc http://images.amazon.com/images/P/002542730X.0... 82
final_rating.shape
(61853, 9)
final_rating.drop_duplicates(['user_id','title'],inplace=True)
final_rating.shape
(59850, 9)
book_pivot = final_rating.pivot_table(columns='user_id', index='title', values= 'rating')
book_pivot
user_id 254 2276 2766 2977 3363 3757 4017 4385 6242 6251 ... 274004 274061 274301 274308 274808 275970 277427 277478 277639 278418
title
1984 9.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
1st to Die: A Novel NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2nd Chance NaN 10.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN
4 Blondes NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
84 Charing Cross Road NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 10.0 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Year of Wonders NaN NaN NaN 7.0 NaN NaN NaN NaN 7.0 NaN ... NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
You Belong To Me NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 ... NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
Zoya NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
\O\" Is for Outlaw" NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 8.0 NaN NaN NaN NaN NaN NaN NaN

742 rows × 888 columns

book_pivot.shape
(742, 888)
book_pivot.fillna(0, inplace=True)
book_pivot
user_id 254 2276 2766 2977 3363 3757 4017 4385 6242 6251 ... 274004 274061 274301 274308 274808 275970 277427 277478 277639 278418
title
1984 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1st to Die: A Novel 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2nd Chance 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Blondes 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
84 Charing Cross Road 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Year of Wonders 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 7.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
You Belong To Me 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Zoya 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
\O\" Is for Outlaw" 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

742 rows × 888 columns

Now Model Training Begins

  • We are using scipy csr_matrix for fast row slicing, faster matrix vector products.
book_sparse = csr_matrix(book_pivot)
  • Now import our clustering algoritm which is Nearest Neighbors this is an unsupervised ml algorithm and used in our model building part.
model = NearestNeighbors(algorithm= 'brute')
model.fit(book_sparse)
NearestNeighbors(algorithm='brute')
  • We are calculating distance and suggestion using k nearest neighbours.
distance, suggestion = model.kneighbors(book_pivot.iloc[237,:].values.reshape(1,-1), n_neighbors=6 )
pickle.dump(model, open('./model/model.pkl','wb'))
distance
array([[ 0.        , 68.78953409, 69.5413546 , 72.64296249, 76.83098333,
        77.28518616]])
suggestion
array([[237, 240, 238, 241, 184, 536]], dtype=int64)
book_pivot.iloc[241,:]
user_id
254       9.0
2276      0.0
2766      0.0
2977      0.0
3363      0.0
         ... 
275970    9.0
277427    0.0
277478    0.0
277639    0.0
278418    0.0
Name: Harry Potter and the Sorcerer's Stone (Book 1), Length: 888, dtype: float64
  • Now printing a suggested book by K-Nearest Neighbour.
for i in range(len(suggestion)):
    print(book_pivot.index[suggestion[i]])
Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
       'The Cradle Will Fall'],
      dtype='object', name='title')
book_pivot.index[5]
'A Bend in the Road'
book_names = book_pivot.index
book_names[2]
'2nd Chance'
np.where(book_pivot.index == 'A Bend in the Road')[0][0]
5

Finding url

ids = np.where(final_rating['title'] == "Harry Potter and the Chamber of Secrets (Book 2)")[0][0]
final_rating.iloc[ids]['image_url']
'http://images.amazon.com/images/P/0439064872.01.LZZZZZZZ.jpg'
  • Storing book name by the id given by suggestion in knearest neaghbours.
book_name = []
for book_id in suggestion:
    book_name.append(book_pivot.index[book_id])
    
    
book_name[0]
Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
       'The Cradle Will Fall'],
      dtype='object', name='title')
  • Extracting index of id from a specific book name.
ids_index = []
for name in book_name[0]: 
    ids = np.where(final_rating['title'] == name)[0][0]
    ids_index.append(ids)
  • Now extracting book given by ids_index
for idx in ids_index:
    url = final_rating.iloc[idx]['image_url']
    print(url)
http://images.amazon.com/images/P/0439064872.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0439136369.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0439139597.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/043936213X.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0446604232.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0440115450.01.LZZZZZZZ.jpg
  • Now storing model and other related file in pickle format for prediction purpose.
pickle.dump(model,open('artifacts/model.pkl','wb'))
pickle.dump(book_names,open('artifacts/book_names.pkl','wb'))
pickle.dump(final_rating,open('artifacts/final_rating.pkl','wb'))
pickle.dump(book_pivot,open('artifacts/book_pivot.pkl','wb'))

Testing model

def recommend_book(book_name):
    book_id = np.where(book_pivot.index == book_name)[0][0]
    distance, suggestion = model.kneighbors(book_pivot.iloc[book_id,:].values.reshape(1,-1), n_neighbors=6 )
    
    for i in range(len(suggestion)):
            books = book_pivot.index[suggestion[i]]
            for j in books:
                if j == book_name:
                    print(f"You searched '{book_name}'\n")
                    print("The suggestion books are: \n")
                else:
                    print(j)
book_name = "Harry Potter and the Chamber of Secrets (Book 2)"
recommend_book(book_name)
You searched 'Harry Potter and the Chamber of Secrets (Book 2)'

The suggestion books are: 

Harry Potter and the Prisoner of Azkaban (Book 3)
Harry Potter and the Goblet of Fire (Book 4)
Harry Potter and the Sorcerer's Stone (Book 1)
Exclusive
The Cradle Will Fall