数据可视化分析票房数据报告
Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on .
欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天,我将研究上提供的TMDB票房预测数据集。
I’ll start by importing some useful libraries that we need in this task.
我将从导⼊此任务中需要的⼀些有⽤的库开始。
import pandas as pd# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('dark_background')
数据加载与探索 (Data Loading and Exploration)
Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test,
and sample_submission file. For this project, my motive is only to perform data analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.
从Kaggle下载数据后,您将拥有3个⽂件。 由于这是⼀场预测⽐赛,因此您具有训练,测试和sample_submission⽂件。 对于这个项⽬,我的动机只是执⾏数据分析和视觉效果。 我将忽略test.csv和sample_submission.csv⽂件。
Let’s load train.csv in data frame using pandas.
让我们使⽤熊猫在数据框中加载train.csv。
%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')# output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms
关于数据集: (About the dataset:)
id: Integer unique id of each moviebelongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.budg Let’s have a look at the sample data.
让我们看⼀下样本数据。
train.head()
As we can see that some features have dictionaries, hence I am dropping all such columns for now.
如我们所见,某些功能具有字典,因此我暂时删除所有此类列。
train = train.drop(['belongs_to_collection', 'genres', 'crew',
存款利率2022最新'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)
Now it time to have a look at statistics of the data.
现在该看⼀下数据统计了。
print("Shape of data is ")
train.shape# OutputShape of data is
(3000, 12)
Dataframe information.
数据框信息。
train.info()# Output
<class 'frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3000 non-null int64
1 budget 3000 non-null int64
2 imdb_id 3000 non-null object
3 original_language 3000 non-null object
4 original_title 3000 non-null object
5 popularity 3000 non-null float64
6 poster_path 2999 non-null object
7 release_date 3000 non-null object
8 runtime 2998 non-null float64
9 status 3000 non-null object
10 title 3000 non-null object
11 revenue 3000 non-null int64
dtypes: float64(2), int64(3), object(7)
memory usage: 281.4+ KB
Describe dataframe.
描述数据框。
train.describe()
Let’s create new columns for release weekday, date, month, and year.
让我们为发布⼯作⽇,⽇期,⽉份和年份创建新列。
train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)train['release_day'] = train['release_date'].apply(lambda t: t.day)tra train['release_year'] = train['release_date'].apply(lambda t: t.year ar < 2018 ar -100)
数据分析与可视化 (Data Analysis and Visualization)
村居古诗的意思Image for post
Photo by on
( 在 上 照⽚
问题1:哪部电影的收⼊最⾼? (Question 1: Which movie made the highest revenue?)
train[train['revenue'] == train['revenue'].max()]
train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')# Please
The Avengers movie has made the highest revenue.
复仇者联盟电影的收⼊最⾼。
问题2:哪部电影的预算最⾼? (Question 2 : Which movie has the highest budget?)
train[train['budget'] == train['budget'].max()]
train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu
Pirates of the Caribbean: On Stranger Tides is most expensive movie.
加勒⽐海盗:惊涛怪浪是最昂贵的电影。
问题3:哪部电影是最长的电影? (Question 3: Which movie is longest movie?)
train[train['runtime'] == train['runtime'].max()]
plt.hist(train['runtime'].fillna(0) / 60, bins=40);
plt.title('Distribution of length of film in hours', fontsize=16, color='white');
plt.xlabel('Duration of Movie in Hours')
plt.ylabel('Number of Movies')
Image for post
train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','reve
Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime.
卡洛斯(Carlos)是最长的电影,有338分钟(5⼩时38分钟)的运⾏时间。
问题4:⼤多数电影在哪⼀年发⾏的? (Question 4: In which year most movies were released?)
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
plt.title("Movie Release count by Year",fontsize=20)
plt.xlabel('Release Year')
plt.ylabel('Number of Movies Release')
plt.show()
非常完美吴大伟微博Image for post
train['release_year'].value_counts().head()# Output2013 141
2015 128
2010 126
2016 125
2012 125
Name: release_year, dtype: int64
In 2013 total 141 movies were released.
2013年,总共发⾏了141部电影。
问题5:最受欢迎和最低⼈⽓的电影。 (Question 5 : Movies with Highest and Lowest popularity.)
Most popular Movie:
最受欢迎的电影:
train[train['popularity']==train['popularity'].max()][['original_title','popularity','release_date','revenue']]
Least Popular Movie:
最不受欢迎的电影:
train[train['popularity']==train['popularity'].min()][['original_title','popularity','release_date','revenue']]
Lets create popularity distribution plot.
让我们创建⼈⽓分布图。
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.distplot(train['popularity'], kde=False)
plt.title("Movie Popularity Count",fontsize=20)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.show()
Image for post
Wonder Woman movie have highest popularity of 294.33 whereas Big Time movie have lowest popularity which is 0.
《神奇⼥侠》电影的最⾼⼈⽓为294.33,⽽《⼤时代》电影的最低⼈⽓为0。
问题6:从1921年到2017年,⼤多数电影在哪个⽉发⾏? (Question 6 : In which month most movies are released from 1921 to 2017?)
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
plt.show()
Image for post
train['release_month'].value_counts()# Output
9 362
10 307
12 263
8 256
4 245
3 238
6 237
2 226
5 224
11 221
现代修真史1 212
7 209
Name: release_month, dtype: int64
In september month most movies are relesed which is around 362.
在9⽉中,⼤多数电影都已发⾏,⼤约362。
问题7:⼤多数电影在哪个⽉上映? (Question 7 : On which date of month most movies are released?)
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
plt.title("Movie Release count by Day of Month",fontsize=20)
plt.xlabel('Release Day')
粗壮的反义词是什么plt.ylabel('Number of Movies Release')
plt.show()
Image for post
train['release_day'].value_counts().head()#Output
1 152
15 126
12 122
7 110
6 107
Name: release_day, dtype: int64
⾸次发布影⽚的最⾼数量为152。 (On first date highest number of movies are released, 152.)问题8:⼤多数电影在⼀周的哪⼀天发⾏? (Question 8 : On which day of week most movies
发布评论