Netflix Data Analysis

Netflix Data Analysis

ยท

18 min read

After binge-watching a lot of series and movies on Netflix, I finally decided to work on its dataset.

Find the dataset at - bit.ly/3cpx80r

This dataset consists of variables like type, title, director, cast, country, release_year, rating, duration, listed_in, description, etc.

In this, I have tried to visualize various information like the count of Movies and TV Shows in the type column, top 10 countries where the maximum content of Netflix is released, top 10 casts featured in Netflix shows, top years with the most number of releases, the rating count, the genre on which Netflix releases maximum content, the most common words used in the description, and many more.

Libraries :

  • Pandas library helps to read the data in different formats.

  • Seaborn, Matplotlib, and wordcloud libraries are used for visualization.

  • Regex is used for finding patterns.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from wordcloud import WordCloud, STOPWORDS

Reading and displaying the data :

Head is used to display top values in the Dataframe.

Netflix_df = pd.read_csv("netflix_titles.csv")
Netflix_df.head(10)

1.PNG

Note :

To check if there are any missing values in the dataset, and to know the data type of each column we use -

Netflix_df.info()

2.PNG

Why is Data Type displayed as Object instead of String?

The string has a special fixed width for each item, whereas the Object allows variable string length.

To check the names of the columns in the dataset :

Netflix_df.columns

3.PNG

Count of Movie and TV Shows :

  1. Applying .unique() on a column of the data frame helps to find the unique values present in the column.
  2. Applying .value_counts() on a column of the data frame helps to find the count associated with unique values present in the column.
Netflix_df["type"].unique()
Netflix_df["type"].value_counts()

4.PNG

#Count Plot
sns.countplot(x="type", data= Netflix_df)
plt.show()
#Pie Chart
count_values = Netflix_df["type"].value_counts()
plt.pie(count_values.values, labels = count_values.index, autopct='%1.1f%%')
plt.show()

5.PNG We can observe that there are 4265 movies which constitute around 68.4% of the dataset followed by TV shows having the count of 1969.

Top 10 Country where Netflix content is released:

  • The country column consists of multiple values separated by a comma in each row, we need to split those values.
Netflix_df["country"] = [f"{code}," for code in Netflix_df["country"]]
text1_country = " ".join(Netflix_df["country"])
text2_country = text1_country.split(",")
text2_country

6.PNG

  • Now we need to count the number of times each country is present in the dataframe so we use a dictionary.
dict2 = {}
for values in text2_country:
    if values in dict2:
        dict2[values] += 1
    else:
        dict2[values] = 1
dict2

7.PNG

  • Convert the dictionary values to a data frame.
df_country_final = pd.DataFrame(dict2.items(), columns=["Country", "Number"])
df_country_final

8.PNG

  • To find the top countries where Netflix content is released, sorting can be done.
top_10_country = df_country_final.sort_values(by="Number", ascending=False).head(10)
top_10_country

9.PNG

  • Bar Plot -
plt.figure(figsize=(10,6))
# make bar plot with matplotlib
sns.barplot('Country', 'Number',data=top_10_country, palette="summer")
plt.xlabel("Country", size=15)
plt.xticks(rotation=90)
plt.ylabel("Number of shows", size=15)
plt.title("Top 10 Country where Netflix content is released", size=18)

10.PNG The topmost country with the maximum releases is the United States. It has more than 2500 releases followed by India and then the United Kingdom.

The splitting will be similar to the country column, then convert it to a dictionary and then a dataframe.

plt.figure(figsize=(10,6))
# make bar plot with matplotlib
sns.barplot('Cast', 'Number',data=top_10_cast[1:], palette="summer")
plt.xlabel("Cast", size=15)
plt.xticks(rotation=90)
plt.ylabel("Number of shows in which the cast featured", size=15)
plt.title("Top 10 casts featured in Netflix shows", size=18)
plt.show()

11.PNG The most featured casts in the Netflix content are mainly Indian celebrities. Anupam Kher tops the list followed by Shah Rukh Khan.

Top 10 years with most releases :

year_count = Netflix_df["release_year"].value_counts()
year_count = year_count[:10]
plt.figure(figsize=(10,5))
sns.barplot(year_count.index, year_count.values, alpha=0.8)
plt.title('Top 10 years with most releases')
plt.ylabel('Number of releases', fontsize=12)
plt.xlabel('year', fontsize=12)
plt.xticks(rotation=90)
plt.show()

12.PNG The most number of releases was observed in the year 2018. The number of releases increased from the year 2010 till the year 2018 and it dropped in the year 2019 to a certain extent.

The Rating distribution :

sns.countplot(x="rating", data=Netflix_df)
plt.xticks(rotation=90)
plt.show()

13.PNG We can observe that most of the content is for mature, adult audiences (TV-MA) and it is unsuitable for children under the age of 17 followed by TV-14 where the content is unsuitable for children below 14 years of age.

G stands for general audiences, PG stands for parental guidance suggested, TV-G content suitable for all ages, PG-13 stands for parents strongly cautioned, and so on.

The durations having maximum count :

  • The top 10 durations comprising major Netflix content.
duration_count = Netflix_df["duration"].value_counts()
duration_count = duration_count[:10]
duration_count

14.PNG

sns.barplot(x=duration_count.index, y=duration_count.values)
plt.xticks(rotation=90)
plt.show()

15.PNG

We can see that the count of TV Shows having just 1 season is approximately 1300, followed by 2 and 3 season TV shows. In the case of movies, we can observe that there are 111 movies having a duration of 90 mins.

Top 10 Genre of Netflix :

The splitting is similar to the country column followed by conversion to a dictionary and then a dataframe.

plt.figure(figsize=(10,6))
# make bar plot with matplotlib
sns.barplot('Genre', 'Number',data=top_10_genre, palette="summer")
plt.xlabel("Genre", size=15)
plt.xticks(rotation=90)
plt.ylabel("Number of shows in each genre", size=15)
plt.title("Top 10 Genre of Netflix", size=18)

16.PNG The total number of International Movies in the dataset is 1927. It is the most released genre on Netflix followed by drama and comedies.

Word Frequency - Description :

The word cloud plot helps us to find the word that is most frequently used in the description column. The greater the size of the word, the more is its frequency. Here the most common words are life, find, family, love, and friend.

#Each of the word in the message will be review
word = " ".join(review for review in Netflix_df["description"])

stopwords = set(STOPWORDS)

#delete the word/text that are commonly used(eg.the,yes,no,bye,or and is)
stopwords.update(["the","is","yea","ok","okay","or","bye","no","will","yeah","I","almost","if","me","you","done","want","Ya"])

#Creating a word cloud 
wordcloud = WordCloud(width = 500, height =500 ,stopwords=stopwords, background_color="black",min_font_size = 10).generate(word)

plt.figure( figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

18.PNG

Type distribution based on the top 10 years :

Important points -

  • Create a copy of the existing data frame.

  • In order to apply a year-wise filter, we need to first set the index as release_year and then filter out the data accordingly.

Netflix_df_2 = Netflix_df.copy()
Netflix_df_type2 = Netflix_df_2.set_index("release_year")
filter_year_type = Netflix_df_type2.loc[[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]]
filter_year_type

19.PNG

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_year_type.index , data= filter_year_type)
plt.title("Type distribution - 2010 to 2019", size=18)
plt.show()

20.PNG The number of TV Shows released by Netflix increased year by year whereas the number of movies released decreased in the year 2018 and 2019.

Type distribution based on top 5 genre :

  • We know that the listed_in column comprises multiple values and the first thing that needs to be done is to split them and expand the data frame.
Netflix_df_1 = Netflix_df.copy()
df = Netflix_df_1.drop('listed_in', axis=1).join(Netflix_df_1.listed_in.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('listed_in'))
df

21.PNG

  • Here we can observe that there are blank values generated in the column hence we need to filter them out.
filter = df["listed_in"] != ""
dfNew = df[filter]
dfNew

22.PNG

  • In order to filter data based on the top 5 genres, we first need to set the index of the data frame as the listed_in column.
Netflix_df_type3 = dfNew.set_index("listed_in")
Netflix_df_type3

23.PNG

  • Filtering the data frame based on the top 5 genres.
filter_genre_type = Netflix_df_type3.loc[["International Movies", " Dramas", " Comedies", " International TV Shows", " Documentaries"]]
filter_genre_type

24.PNG

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_genre_type.index , data= filter_genre_type)
plt.title("Type distribution - Top 5 genre", size=18)
plt.show()

25.PNG Here we can observe that there are no defined genres for TV Shows except International TV Shows. Movies are classified as International movies, dramas, comedies, and documentaries.

Type distribution based on Rating :

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = "rating", data= Netflix_df)
plt.title("Type distribution based on Rating", size=18)
plt.legend(loc="upper right")
plt.show()

26.PNG Most of the Movies on Netflix are for mature, adult audiences unsuitable for children under 17. Whereas, for TV Shows most of the shows have ratings of TV-MA and TV-14.

Type distribution based on duration :

  • Make a copy of the data frame, then set the index as a duration column in order to apply the filter based on time span.
Netflix_df_4 = Netflix_df.copy()
Netflix_df_type4 = Netflix_df_4.set_index("duration")
Netflix_df_type4

27.PNG

  • We will make a separate prediction for Movies and TV Shows based on duration.

Movies -

Movies = Netflix_df_type4[Netflix_df_type4["type"] == "Movie"]
Movies

28.PNG

  • After filtering the type as Movies, we will filter based on duration.
filter_duration_Movie = Movies.loc[['90 min', '91 min', '92 min', '94 min', '95 min']]
filter_duration_Movie

29.PNG

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_duration_Movie.index, data= filter_duration_Movie)
plt.title("Movie distribution based on durations", size=18)
plt.legend(loc="upper right")
plt.show()

30.PNG

TV Shows -

The filtering process is similar to the movie type.

TV_Show = Netflix_df_type4[Netflix_df_type4["type"] == "TV Show"]
TV_Show
filter_duration_TV = TV_Show.loc[["1 Season", "2 Seasons", "3 Seasons"]]
filter_duration_TV
plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_duration_TV.index, data= filter_duration_TV)
plt.title("TV distribution based on Seasons released", size=18)
plt.legend(loc="upper right")
plt.show()

31.PNG

Type distribution based on top 5 countries :

  • The country column consists of values separated by commas hence we need to split the values and expand the dataframe.
Netflix_df_8 = Netflix_df.copy()
df8 = Netflix_df_8.drop('country', axis=1).join(Netflix_df_8.country.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('country'))
df8

32.PNG

  • We can observe that there are blank values in the country column, therefore we will filter them out.
filter1 = df8["country"] != ""
dfNew1 = df8[filter1]
dfNew1

33.PNG

  • Filtering the data frame based on the top 5 countries.
filter_country_type = dfNew1.set_index("country")
filter_country_1 = filter_country_type.loc[["United States", " India", " United Kingdom", " Canada", " France"]]
filter_country_1

34.PNG

plt.figure(figsize=(15,8))
sns.countplot(x="type", hue = filter_country_1.index , data= filter_country_1)
plt.title("Type distribution - Top 5 Countries", size=18)
plt.show()

35.PNG

Top 5 countries and their corresponding releases from the year 2010 till 2019 :

  • Make a copy of the dataframe where we had filtered the data frame based on the top 5 countries.
filter_country_year = filter_country_1.copy()
filter_country_year
  • Reset the index of a dataframe to the original one and then set it as release_year.
filter_country_year.reset_index(inplace = True) 
filter_values_year = filter_country_year.set_index("release_year")
filter_values_year
  • Now filter the dataframe based on release_year from the year 2010 till the year 2019.
filter_year_country1 = filter_values_year.loc[[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]]
filter_year_country1

36.PNG

plt.figure(figsize=(15,8))
sns.countplot(x="country", hue = filter_year_country1.index , data= filter_year_country1)
plt.ylim(0,400)
plt.title("Top 5 countries - 2010 to 2019 distribution", size=18)
plt.show()

37.PNG We can observe that maximum releases are made in the United States. The releases increased from the year 2010 till 2018 and the numbers decreased in the year 2019.

Top 5 countries and their rating distribution :

  • Make a copy of the dataframe having data of only the top 5 countries, then set the index as the country column.
country_rating_df = filter_country_year.copy()
country_rating_1 = country_rating_df.set_index("country")
plt.figure(figsize=(15,8))
sns.countplot(x="country", hue = "rating" , data= country_rating_df)
plt.legend(loc="upper right")
plt.title("Top 5 countries Vs Rating distribution", size=18)
plt.show()

38.PNG As the United States has more releases, we can see a wide distribution of ratings allotted to movies and tv shows released. Most of the releases belong to TV-MA ratings.

Top 5 countries and their genre(top 5) distribution :

  • Make a copy of the dataframe having data of only the top 5 countries, then as the genre column consists of multiple values separated by comma, we will split it and expand the data frame.
filter_country_listed_in = filter_country_year.copy()
df7 = filter_country_listed_in.drop('listed_in', axis=1).join(filter_country_listed_in.listed_in.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('listed_in'))
df7
  • Filtering out the blank data present in the column.
filter7 = df7["listed_in"] != ""
dfNew7 = df7[filter7]
dfNew7
  • Set the index based on the listed_in column and apply the filter of the top 5 genres.
dfNew7 = dfNew7.set_index("listed_in")
filter_country_listed_in = dfNew7.loc[["International Movies", " Dramas", " Comedies", " International TV Shows", " Documentaries"]]
filter_country_listed_in

39.PNG

plt.figure(figsize=(15,8))
sns.countplot(x="country", hue = filter_country_listed_in.index , data= filter_country_listed_in)
plt.legend(loc="upper right")
plt.title("Top 5 countries Vs Genre distribution", size=18)
plt.show()

40.PNG The United States has maximum releases in the drama and comedy genre, whereas Indian releases only comprised of drama and comedy from the top 5 genres.

Top 10 release year Vs Top 5 genres :

  • Make a copy of the dataframe filtered based on the release_year column and reset the index to the default one.
year_listed_in = filter_year_type.copy()
year_listed_in.reset_index(inplace = True) 
year_listed_in
  • Splitting the rows based on comma values in listed_in and expanding the dataframe.
filter_year_listed_in = year_listed_in.copy()
df10 = year_listed_in.drop('listed_in', axis=1).join(year_listed_in.listed_in.str.split(",", expand=True).stack().reset_index(drop=True, level=1).rename('listed_in'))
df10
  • Filtering the blank rows present in the listed_in column and set the index to "listed_in".
filter10 = df10["listed_in"] != ""
dfNew10 = df10[filter10]
dfNew10
dfNew10 = dfNew10.set_index("listed_in")
dfNew10
  • Filtering the dataframe based on the top 5 genres.
filter_year_listed_in = dfNew10.loc[["International Movies", " Dramas", " Comedies", " International TV Shows", " Documentaries"]]
filter_year_listed_in
plt.figure(figsize=(15,8))
sns.countplot(x="release_year", hue = filter_year_listed_in.index , data= filter_year_listed_in)
plt.legend(loc="upper left")
plt.title("Top 10 release year Vs Top 5 genres", size=18)
plt.show()

41.PNG It is evident that the number of releases in the top 5 genres increased over the period of time except for the year 2019 where the drop in the number of releases was observed.

Hope you found it helpful and enjoyed the binge watch. Thank You! ๐Ÿ˜Š

ย