Goodreads — Exploratory Data Analysis and Discovering Relationships

Waleed Hashmi
10 min readNov 20, 2020

The basic idea behind analyzing the Goodreads dataset is to get a fair idea about the relationships between the multiple attributes a book might have, such as: the average rating of books weighted by multiple factors, the performance of authors over the years, discover books with fake ratings, and find factors that ensure a book’s success in today’s age.

Column Description

  • bookID Contains the unique ID for each book/series
  • title contains the titles of the books
  • authors contains the author of the particular book
  • average_rating the average rating of the books, as decided by the users
  • language_code Tells the language for the books
  • Num_pages Contains the number of pages for the book
  • Ratings_count Contains the number of ratings given for the book
  • text_reviews_count Has the count of text reviews left by user
  • publication_date: DD/MM/YYYY — Date of publication

Part 1 — EDA on Titles

  1. Which are the Top 20 books with the most occurrences in the list.
    We find out which books have been published various times under the same name since its first edition. Also, we include any supplementary material published for these books. These books have clearly been popular over generations, are considered essential reads, referenced to by multiple authors, and have stood the test of time.

According to the results, Macbeth, Collected Stories, The Iliad, are the top 3 books to have been published multiple times and written about by other authors.

2. What is the distribution of books for all languages?
Which languages are preferred by authors — which language are most books written/published in?

The results show that English, and its US and Great Britain variants are the most preferred by authors.

Exploring the relationship between Average Ratings and Rating Count

3. Plotting the Distribution of Average Ratings

We can see from the distribution above, most of the average ratings are centered around the 4.0 mark. From the results, we can interpret that readers are likely to enjoy a book they decide to read.

4. Which books have received the most critiques? — Top 20 Most Voted Books (ratings_count)
In this section, we find the books that have been rated the most i.e. vote count, irrespective of the rating received.

The books with the highest vote count are:
1. Twilight with more than 4 million votes.
2. The Hobbit or There and Back Again with more than 2 million votes.
3. The Catcher in the Rye with more than 2 million votes.

Moreover, we see a trend in the results. A lot of books in the list belong to a series of books, mostly fiction. This gives us the notion that once readers begin a series, they are likely to read subsequent releases.

Furthermore, we also observe that the earliest books in a series are the most voted, and the number of readers/voters go down with subsequent releases, showing that the first books are the most critiqued and the reader’s lose interest as the series progresses.

5. Which books have highest rating? Given that they have at least a million votes — Top 20 Highly Rated Books (average_rating)
Here, we extract the books that have the highest average rating on the condition that they have been rated by at least a million users.

The Harry Potter Series tops the chart, securing the first four positions with average ratings ranging from 4.42–4.57

The results hold importance as the books have been voted on by at least a million readers — a large sample size, meaning that the books are popular and well-liked by a large number of readers.

6. Scatter Plot — Relationship between Average Ratings and Rating Count.

From the scatterplot, we find that books with over a million votes are rare, yet these are the books whose average ratings are to be considered credible and statistically significant. We also observe that most of the books have a rating count below 0.25 Million. Moreover as the rating count increases, average rating gets closer to 4.0

7. Books that have received the most votes yet aren’t rated high — Top Voted but not Top Rated.
Here, we find the books that have a high number of votes but are not highly rated. This indicates that such books are popular yet controversial, and a large number of people have read them yet critiqued negatively and the books have not been well received.

8. Books that have received high ratings but have a relatively low vote count — Top Rated but not Top Voted.
Here, we find the books that have a high rating but do not have a significant vote count. This indicates that the high rating of these books is not statistically significant due to low vote count — rating is not considered credible.

9. What are the most common/recurring words in book titles that have an average rating of at least 4.0.

Part 2— EDA on Authors

10. Top 20 Frequently Published Authors
In this section, we find the authors that have been published the most i.e. authors with the most books.

Stephen King is the most published authors with 62 books published. William Shakespeare and P.G. Wodehouse are second with 45 books published.

We can see from the above plot that Stephen King has the most number of books in the list — although a lot of them might be just various publications for the same book, this does not get rid of the fact that his books are in demand.

From the names in the list, we can infer that most of the authors have either been writing for long periods, releasing numerous books regularly, or are authors who are well liked, or their books are hyped and have a cult following in niche genres.

11. Top 20 Frequently Published Authors — Weighted by Average Rating >= 4.0
Here, we see the results in Part 10 with a different perspective. Instead of considering all the books when finding most published authors, we only count those books which have achieved an average rating of at least 4.

P.G. Wodehouse has published the most books while maintaining an average rating of at least 4.0 across his books.

Here, we see a lot of the same authors we saw in the results of Part 10 and a lot of new authors as well. Most of the authors in the previous result retain their position in the Top 20.

Authors at the lower end of this result haven’t published as much as those in the previous result, but their books are well received and maintain an average rating of at least 4.0.

12. Average Rating of Authors with at least a million votes.
Here, we find the Top 20 authors based on the average ratings their books have received over time.

Here, we see that J.K. Rowling is ways ahead of other writers — this does not come as a surprise as the Harry Potter Series is one of the most popular and revered.

13. Performance of an author over time.
Here, we take the top 4 most published authors and assess their performance/average ratings over their career.

We’ll test out the performance of these four authors over time:

Stephen King

P.G. Wodehouse

William Shakespeare

Agatha Christie

13.1. Stephen King

We can notice from the plot that Stephen King was at his best in 1983, 1986, and 1996, hitting the peak of his career in these three years. Stephen’s performance usually dips at the start of a decade, dipping to 3.8 in both 1990 and 2000. His performance usually increases a few years into a decade.

13.2. P.G. Wodehouse

P.G. Wodehouse usually maintained a high rating of at least 4 over his career, reaching the peak in 2007, his performance dipping below 4.0 only in 1991 and 2008 respectively.

13.3. William Shakespeare

We notice that William Shakespeare had the lowest rating among the four authors for a few years. However, he also achieved the highest rating among the four authors in 1991.

13.4. Agatha Christie

Agatha Christie has entered the 4.0 mark a lot fewer times than others, yet her average rating/performance remains consistent, never dipping lower than usual.

Part 3— Exploring further relationships — Setting up for ML

14. Is there a relationship between the number of text reviews and average rating.

We can observe from the plot that most of the ratings for the books lie around the 3.5–4.0 mark. However, most of the text reviews are very low in number, clustered around the 5000 mark. Let’s limit our view to 5000 text review counts in order to get a closer look:

On a closer look, we see that text review count is densely clustered under the 1000 mark, leaving us with not enough points to establish a relationship. Moreover, however limited the review counts are, most of them are for books with an average rating of around 4.0.

The results point to two conclusions, either the text reviews are scam or any random book has a high chance to be well liked by readers.

15. Is there a relationship between the number of pages and average rating.

We can observe from the plot that most books are under a 1000 pages, the presence of outliers above 100 pages, renders the graph uninterpretable. Let’s limit our view to 1000 pages in order to get a closer look:

From the given plot, we observe that the highest ratings 4.5–5.0 usually are for books with the page range of 200–400. We can infer that people prefer books with a moderate number of pages.

16. Is there a relationship between the number of ratings and average rating.

Due to outliers present, we limit our view to 2 million ratings to get a closer look:

From the results, we observe that most of the books have a rating count below 250000. Moreover as the rating count increases, average rating gets closer to 4.0

Next steps

Since, a linear/polynomial relationship couldn’t be established between rating counts and average rating, we’ll make an attempt at unsupervised learning algorithms. particularly K-Means clustering in order to find groups between rating count and average ratings.

Machine Learning — K-Means

Here we attempt to find a relationship or groups between the rating count and average rating value.

We’ll use the Elbow Curve method to determine the number of clusters present in the above data.

After performing k-means on the data, we get the following result.

The plot above shows that the elbow lies at K=5. The dataset is therefore divided into 5 clusters.

Next, we visualize the clusters using a scatter plot:

We can observe that the dataset can be binned into 5 clusters, the stars indicate the center of our clusters. However we also observe that the presence of outliers skew our results and results can be further improved.

Identifying and Removing Outliers

Plotting K-Mean clusters again using our new optimized dataset:

Conclusion

Now that we have optimized the clusters, we can form our inferences

As the rating count increases, the clusters become sparser but less volatile — average rating of books with high rating count have low variance, meaning that if a book has a high rating count, we can predict its rating with higher accuracy and a higher confidence level.

As the rating count decreases, the clusters become dense yet more volatile — average rating of books with low rating count have a high variance, meaning that if a book has a low rating count, our prediction of its rating will have low accuracy and a low confidence level.

--

--