Tableau Case Study : IMDb Dataset

Leonie M Windari
8 min readJan 1, 2021

Hello everyone!

Today I am going to practice on IMDb dataset. You guys probably know IMDb right? It’s a popular and authoritative source for movie, TV and celebrity content. You can find ratings and reviews for the newest movie and TV shows there.

Note that this is based on edureka youtube video that I linked below. I use the dataset for practice and then at the end I will compare my result with the result from edureka.

You can download the dataset here.

Before we start, like always we should determine the purpose or the goal of the data visualization first. In this case study, we will conduct a Exploratory Data Analysis (EDA) on the IMDb dataset and understand the relationship between various parameters in the dataset using Tableau.

There are some questions that we can ask for the purpose of this Data Analysis.

1. Is there any relationship between movie budget and revenue?

If we’re being asked this question before even looking at the data, we surely will answer that the bigger the movie budget then the bigger revenue they will get. But from this graph, you can see that you thought wrong!

Figure 1. Movie Budget vs Revenue

It turns out that there is no connection between Movie Budget and the Revenue. You can make a movie with low budget and still get a big revenue, likewise, spending a lot of budget on a movie doesn’t ensure that it will get a big revenue.

2. What are the duration outliers in various genre of movies?

What genre of movie do you think have the biggest duration out of all?

I’m not a big of a fan of a movie but if I could answer, I think Bollywood movie usually have longer duration than other (sadly the genre is not in this dataset), it usually have a duration about 150 minutes or more.

Figure 2. Duration Range in Various Range

From the boxplot, here we can see that Action, Adventure, Biography, Crime, and Drama genre have a lot of outliers, while the Family, Music, Musical, Romanci, Sci-Fi, and Western almost doesn’t have an outliers. We can also see that the longest duration is in the Biography genre. It is possible since Biography genre usually have a long duration.

We can also see the median, upper, and lower value but sadly we can’t see it here. You can see it in my Tableau Online where I post the workbook there. If you’re trying to make a movie, you can see the median of the duration based on the genre of movie you’re trying to make!

3. How is the distribution of various movie duration?

If we’re talking about distribution, distribution plot and histogram come to my mind. For this dataset, I am using histogram but here you have to make the bin first. You can do it by simply click create bins in the feature that you want. It will give you the suggested bins or you can decide it on your own.

Figure 3. Distribution of Different Movies Duration

Here we can see that most of the movie fell into the range of 100–120. It means that most movies have a 100 to 120 minutes duration, followed by 70–100 minutes and then 120–140 minutes.

4. Does having more Facebook likes have an impact of revenue?

This question is really interesting because we can see if having a lot of facebook likes actually affecting the revenue value or not. From here we can see whether marketing through Facebook post helps the movie revenue or not.

Figure 4. Impact of Facebook Likes to Revenue
Figure 5. Impact of Facebook Likes to Revenue

From the graph below, it turns out that the amount of likes in facebook does not affect the revenue value. Although you can see that in the graph of Revenue vs Movie Facebook Likes there are some relationship visible, you can also see that a movie can have a small amount of likes in facebook and still get a big revenue.

5. Is there any relationship between Facebook likes and IMDb voting?

If in the question before we see that the Facebook likes doesn’t have a relation with the revenue, let’s see if it have some relation with the IMDb score.

Figure 6. Relationship between IMDb Score and Facebook Likes
Figure 7. Relationship between IMDb Score and Facebook Likes

Here we can see that the Facebook Likes doesn’t event IMDb rating that much, except for the Movie Facebook Likes. It make sense because having a lot of facebook likes for the movie means that a lot of people likes the movie, leading to it having a greater IMDb score.

6. Correlation matrix between various numerical data points?

In the dataset we have a lot of features that we can analyze. The features can have a relation with each other. How can we know which features that have a correlation? We can use correlation matrix. Correlation matrix is basicly plotting features with features to see the correlation between each other.

Figure 8. Correlation Matrix between Various Numerical Data Points

Here I’m just plotting some of the numerical features which I am curious about. We can see that IMDb Score actually have a correlation with Duration and Gross, and Duration and Gross have a relation with each other. Correlation matrix is good if you only have a few features, but if you have more than 4 features it will be hard to see the correlation between each other. Instead, you can use heat map or Calculate the Pearson Correlation Coefficient and set is to differentiate by color.

7. Is the genre budget changing as the time is changing?

You must’ve been curious on how movie budget changing over the year. Like any other things, of course movie budget will increase over the year by inflation. Just like how ice cream only cost about $1 in 2000, now it can cost up to $5.

Figure 9. Movie Budget Over The Year

Here I plotted the Median of the Movie Budget over the Year. I use the median of the movie budget because I know that the movie budget have a lot of outliers so using an average won’t represents the movie budget in that particular year.

From the graph we can see that in some years, the movie budget is increasing but in some years it doesn’t (it stays constant or decreasing). Although since 1989 it seems like the movie budget always fell above 15M. Before, we expect the movie budget to increase over the year (because of inflation) but here we can see that it does not affect the movie budget.

So the answer it No.

It is changing but it’s not because of the time, there are probably other factor that affecting it, like : genres, quality, story line, actors, and so on.

8. What is the distribution of IMDb ratings among various genre?

We probably also wonder, does a specific genre is more favorable to the audience than the others? Does a specific genre always have a high IMDb ratings?

We can use the boxplot to see the distribution of IMDb rating in various genres and we can get some insight out of it.

Figure 10. IMDb Rating Distribution in each Genres

Here we can see that the Documentary genre seems to have a highest IMDb score than the others (based on the median value) and Thriller genre seems to have a lowest IMDb score. But we can also see that in some genres there are not enough data so it doesn’t represent the distribution of IMDb rating in that genres well.

Okay so after I read the question again, I found out that you can either see the distribution of IMDb ratings in each genres (Figure 10) or you can see the distribution of genres in IMDb ratings (Figure 11)

Figure 11. Distribution of Genres in IMDb Ratings

From the graph we can see that most of the genres fell between IMDb ratings of 5 to 7 and mostly from Action, Comedy, and Drama genre. Both graph can be used based on what insight you want to get. If you want to know whether a genre affect the IMDb ratings or not (ex : Documentary genre always have a rating below 7) you can use graph in Figure 10. If you want to know which genre fill up the IMDb score of 6 (in this case Comedy) you can use graph in Figure 11.

9. What is the most revenue fetching category for a movie?

This is a good question if you want to make a movie and you want to get a lot of revenue. Here I will plot it over the past 20 years and hopefuly we can get some insight from it.

Figure 12. Revenue in Genre over the Year

Here we can see that in some years, example : 1996 the biggest revenue is from Animation genre, while in 2006 is from Mystery. But overall we can conclude that Animation genre give the most revenue over the year and we can see it by the graph below.

Figure 13. Revenue based on Genre

After compare it with the answer from edureka, I found that a lot of my works is different, but it’s okay! It all depends on how you interpret the question, how you want to answer it, and how you want to visualize it.

I also learn to think first before doing. You have to visualize it in your head first before moving your mouse to select features in Tableau. Sure Tableau is really smart, you can just drag and drop some features and Tableau will do the magic for you. It will suggest you graph that you can choose. But if you don’t understand the data well, if you don’t know what the graph means, it’s meaningless.

Doing this case study and two previous case study, what I learn is :

  1. I learn how to really think and understand the questions before starting anything
  2. I learn how I should think first on how want to visualize the data
  3. I learn how to make hypothesis of the result
  4. I learn how to visualize data so people can understand it clearly and easy to notice pattern/ get some insight

This past 3 days have been really fun! Thank you edureka for the case study!

For the next 10 days or so, I will broaden my skills on Tableau more. After this, I will search a dataset and try to visualize it so I can get some insight. Please stay tune!

--

--

Leonie M Windari

a curious human being. current enemies : manual data entry. current motivation : weekends and deadlines.