Netflix Data Analysis

After binge-watching a lot of series and movies on Netflix, I finally decided to work on its dataset.

Find the dataset at - bit.ly/3cpx80r

This dataset consists of variables like type, title, director, cast, country, release_year, rating, duration, listed_in, description, etc.

In this, I have tried to visualize various information like the count of Movies and TV Shows in the type column, top 10 countries where the maximum content of Netflix is released, top 10 casts featured in Netflix shows, top years with the most number of releases, the rating count, the genre on which Netflix releases maximum content, the most common words used in the description, and many more.

Libraries :

Pandas library helps to read the data in different formats.
Seaborn, Matplotlib, and wordcloud libraries are used for visualization.
Regex is used for finding patterns.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from wordcloud import WordCloud, STOPWORDS

Reading and displaying the data :

Head is used to display top values in the Dataframe.

Netflix_df = pd.read_csv("netflix_titles.csv")
Netflix_df.head(10)

Note :

To check if there are any missing values in the dataset, and to know the data type of each column we use -

Netflix_df.info()

Why is Data Type displayed as Object instead of String?

The string has a special fixed width for each item, whereas the Object allows variable string length.

To check the names of the columns in the dataset :

Netflix_df.columns

Count of Movie and TV Shows :

Applying .unique() on a column of the data frame helps to find the unique values present in the column.
Applying .value_counts() on a column of the data frame helps to find the count associated with unique values present in the column.

Netflix_df["type"].unique()
Netflix_df["type"].value_counts()

#Count Plot
sns.countplot(x="type", data= Netflix_df)
plt.show()

#Pie Chart
count_values = Netflix_df["type"].value_counts()
plt.pie(count_values.values, labels = count_values.index, autopct='%1.1f%%')
plt.show()

We can observe that there are 4265 movies which constitute around 68.4% of the dataset followed by TV shows having the count of 1969.

Top casts featured in Netflix shows :

The splitting will be similar to the country column, then convert it to a dictionary and then a dataframe.

plt.figure(figsize=(10,6))
# make bar plot with matplotlib
sns.barplot('Cast', 'Number',data=top_10_cast[1:], palette="summer")
plt.xlabel("Cast", size=15)
plt.xticks(rotation=90)
plt.ylabel("Number of shows in which the cast featured", size=15)
plt.title("Top 10 casts featured in Netflix shows", size=18)
plt.show()

The most featured casts in the Netflix content are mainly Indian celebrities. Anupam Kher tops the list followed by Shah Rukh Khan.

Top 10 years with most releases :

year_count = Netflix_df["release_year"].value_counts()
year_count = year_count[:10]
plt.figure(figsize=(10,5))
sns.barplot(year_count.index, year_count.values, alpha=0.8)
plt.title('Top 10 years with most releases')
plt.ylabel('Number of releases', fontsize=12)
plt.xlabel('year', fontsize=12)
plt.xticks(rotation=90)
plt.show()

The most number of releases was observed in the year 2018. The number of releases increased from the year 2010 till the year 2018 and it dropped in the year 2019 to a certain extent.

The Rating distribution :

sns.countplot(x="rating", data=Netflix_df)
plt.xticks(rotation=90)
plt.show()

We can observe that most of the content is for mature, adult audiences (TV-MA) and it is unsuitable for children under the age of 17 followed by TV-14 where the content is unsuitable for children below 14 years of age.

G stands for general audiences, PG stands for parental guidance suggested, TV-G content suitable for all ages, PG-13 stands for parents strongly cautioned, and so on.

The durations having maximum count :

The top 10 durations comprising major Netflix content.

duration_count = Netflix_df["duration"].value_counts()
duration_count = duration_count[:10]
duration_count

sns.barplot(x=duration_count.index, y=duration_count.values)
plt.xticks(rotation=90)
plt.show()

We can see that the count of TV Shows having just 1 season is approximately 1300, followed by 2 and 3 season TV shows. In the case of movies, we can observe that there are 111 movies having a duration of 90 mins.

Top 10 Genre of Netflix :

The splitting is similar to the country column followed by conversion to a dictionary and then a dataframe.

plt.figure(figsize=(10,6))
# make bar plot with matplotlib
sns.barplot('Genre', 'Number',data=top_10_genre, palette="summer")
plt.xlabel("Genre", size=15)
plt.xticks(rotation=90)
plt.ylabel("Number of shows in each genre", size=15)
plt.title("Top 10 Genre of Netflix", size=18)

The total number of International Movies in the dataset is 1927. It is the most released genre on Netflix followed by drama and comedies.

Word Frequency - Description :

The word cloud plot helps us to find the word that is most frequently used in the description column. The greater the size of the word, the more is its frequency. Here the most common words are life, find, family, love, and friend.

#Each of the word in the message will be review
word = " ".join(review for review in Netflix_df["description"])

stopwords = set(STOPWORDS)

#delete the word/text that are commonly used(eg.the,yes,no,bye,or and is)
stopwords.update(["the","is","yea","ok","okay","or","bye","no","will","yeah","I","almost","if","me","you","done","want","Ya"])

#Creating a word cloud 
wordcloud = WordCloud(width = 500, height =500 ,stopwords=stopwords, background_color="black",min_font_size = 10).generate(word)

plt.figure( figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Type distribution based on the top 10 years :

Important points -

Create a copy of the existing data frame.
In order to apply a year-wise filter, we need to first set the index as release_year and then filter out the data accordingly.

Netflix_df_2 = Netflix_df.copy()
Netflix_df_type2 = Netflix_df_2.set_index("release_year")
filter_year_type = Netflix_df_type2.loc[[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]]
filter_year_type

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_year_type.index , data= filter_year_type)
plt.title("Type distribution - 2010 to 2019", size=18)
plt.show()

The number of TV Shows released by Netflix increased year by year whereas the number of movies released decreased in the year 2018 and 2019.

Type distribution based on top 5 genre :

We know that the listed_in column comprises multiple values and the first thing that needs to be done is to split them and expand the data frame.

Netflix_df_1 = Netflix_df.copy()
df = Netflix_df_1.drop('listed_in', axis=1).join(Netflix_df_1.listed_in.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('listed_in'))
df

Here we can observe that there are blank values generated in the column hence we need to filter them out.

filter = df["listed_in"] != ""
dfNew = df[filter]
dfNew

In order to filter data based on the top 5 genres, we first need to set the index of the data frame as the listed_in column.

Netflix_df_type3 = dfNew.set_index("listed_in")
Netflix_df_type3

Filtering the data frame based on the top 5 genres.

filter_genre_type = Netflix_df_type3.loc[["International Movies", " Dramas", " Comedies", " International TV Shows", " Documentaries"]]
filter_genre_type

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_genre_type.index , data= filter_genre_type)
plt.title("Type distribution - Top 5 genre", size=18)
plt.show()

Here we can observe that there are no defined genres for TV Shows except International TV Shows. Movies are classified as International movies, dramas, comedies, and documentaries.

Type distribution based on Rating :

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = "rating", data= Netflix_df)
plt.title("Type distribution based on Rating", size=18)
plt.legend(loc="upper right")
plt.show()

Most of the Movies on Netflix are for mature, adult audiences unsuitable for children under 17. Whereas, for TV Shows most of the shows have ratings of TV-MA and TV-14.

Type distribution based on duration :

Make a copy of the data frame, then set the index as a duration column in order to apply the filter based on time span.

Netflix_df_4 = Netflix_df.copy()
Netflix_df_type4 = Netflix_df_4.set_index("duration")
Netflix_df_type4

We will make a separate prediction for Movies and TV Shows based on duration.

Movies -

Movies = Netflix_df_type4[Netflix_df_type4["type"] == "Movie"]
Movies

After filtering the type as Movies, we will filter based on duration.

filter_duration_Movie = Movies.loc[['90 min', '91 min', '92 min', '94 min', '95 min']]
filter_duration_Movie

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_duration_Movie.index, data= filter_duration_Movie)
plt.title("Movie distribution based on durations", size=18)
plt.legend(loc="upper right")
plt.show()

TV Shows -

The filtering process is similar to the movie type.

TV_Show = Netflix_df_type4[Netflix_df_type4["type"] == "TV Show"]
TV_Show

filter_duration_TV = TV_Show.loc[["1 Season", "2 Seasons", "3 Seasons"]]
filter_duration_TV

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_duration_TV.index, data= filter_duration_TV)
plt.title("TV distribution based on Seasons released", size=18)
plt.legend(loc="upper right")
plt.show()

Type distribution based on top 5 countries :

The country column consists of values separated by commas hence we need to split the values and expand the dataframe.

Netflix_df_8 = Netflix_df.copy()
df8 = Netflix_df_8.drop('country', axis=1).join(Netflix_df_8.country.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('country'))
df8

We can observe that there are blank values in the country column, therefore we will filter them out.

filter1 = df8["country"] != ""
dfNew1 = df8[filter1]
dfNew1

Filtering the data frame based on the top 5 countries.

filter_country_type = dfNew1.set_index("country")
filter_country_1 = filter_country_type.loc[["United States", " India", " United Kingdom", " Canada", " France"]]
filter_country_1

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_country_1.index , data= filter_country_1)
plt.title("Type distribution - Top 5 Countries", size=18)
plt.show()

Top 5 countries and their corresponding releases from the year 2010 till 2019 :

Make a copy of the dataframe where we had filtered the data frame based on the top 5 countries.

filter_country_year = filter_country_1.copy()
filter_country_year

Reset the index of a dataframe to the original one and then set it as release_year.

filter_country_year.reset_index(inplace = True) 
filter_values_year = filter_country_year.set_index("release_year")
filter_values_year

Now filter the dataframe based on release_year from the year 2010 till the year 2019.

filter_year_country1 = filter_values_year.loc[[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]]
filter_year_country1

plt.figure(figsize=(15,8))
sns.countplot(x="country", hue = filter_year_country1.index , data= filter_year_country1)
plt.ylim(0,400)
plt.title("Top 5 countries - 2010 to 2019 distribution", size=18)
plt.show()

We can observe that maximum releases are made in the United States. The releases increased from the year 2010 till 2018 and the numbers decreased in the year 2019.

Top 5 countries and their rating distribution :

Make a copy of the dataframe having data of only the top 5 countries, then set the index as the country column.

country_rating_df = filter_country_year.copy()
country_rating_1 = country_rating_df.set_index("country")

plt.figure(figsize=(15,8))
sns.countplot(x="country", hue = "rating" , data= country_rating_df)
plt.legend(loc="upper right")
plt.title("Top 5 countries Vs Rating distribution", size=18)
plt.show()

As the United States has more releases, we can see a wide distribution of ratings allotted to movies and tv shows released. Most of the releases belong to TV-MA ratings.

Top 5 countries and their genre(top 5) distribution :

Make a copy of the dataframe having data of only the top 5 countries, then as the genre column consists of multiple values separated by comma, we will split it and expand the data frame.

filter_country_listed_in = filter_country_year.copy()
df7 = filter_country_listed_in.drop('listed_in', axis=1).join(filter_country_listed_in.listed_in.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('listed_in'))
df7

Filtering out the blank data present in the column.

filter7 = df7["listed_in"] != ""
dfNew7 = df7[filter7]
dfNew7

Set the index based on the listed_in column and apply the filter of the top 5 genres.

dfNew7 = dfNew7.set_index("listed_in")
filter_country_listed_in = dfNew7.loc[["International Movies", " Dramas", " Comedies", " International TV Shows", " Documentaries"]]
filter_country_listed_in

plt.figure(figsize=(15,8))
sns.countplot(x="country", hue = filter_country_listed_in.index , data= filter_country_listed_in)
plt.legend(loc="upper right")
plt.title("Top 5 countries Vs Genre distribution", size=18)
plt.show()

The United States has maximum releases in the drama and comedy genre, whereas Indian releases only comprised of drama and comedy from the top 5 genres.

Top 10 release year Vs Top 5 genres :

Make a copy of the dataframe filtered based on the release_year column and reset the index to the default one.

year_listed_in = filter_year_type.copy()
year_listed_in.reset_index(inplace = True) 
year_listed_in

Splitting the rows based on comma values in listed_in and expanding the dataframe.

filter_year_listed_in = year_listed_in.copy()
df10 = year_listed_in.drop('listed_in', axis=1).join(year_listed_in.listed_in.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('listed_in'))
df10

Filtering the blank rows present in the listed_in column and set the index to "listed_in".

filter10 = df10["listed_in"] != ""
dfNew10 = df10[filter10]
dfNew10

dfNew10 = dfNew10.set_index("listed_in")
dfNew10

Filtering the dataframe based on the top 5 genres.

filter_year_listed_in = dfNew10.loc[["International Movies", " Dramas", " Comedies", " International TV Shows", " Documentaries"]]
filter_year_listed_in

plt.figure(figsize=(15,8))
sns.countplot(x="release_year", hue = filter_year_listed_in.index , data= filter_year_listed_in)
plt.legend(loc="upper left")
plt.title("Top 10 release year Vs Top 5 genres", size=18)
plt.show()

It is evident that the number of releases in the top 5 genres increased over the period of time except for the year 2019 where the drop in the number of releases was observed.

Hope you found it helpful and enjoyed the binge watch. Thank You! 😊