数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告_票房收⼊分析和可视化中国四大民间传说

数据可视化分析票房数据报告

Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on .

欢迎回到我的100天数据科学挑战之旅。在第4天和第5天，我将研究上提供的TMDB票房预测数据集。

I’ll start by importing some useful libraries that we need in this task.

我将从导⼊此任务中需要的⼀些有⽤的库开始。

import pandas as pd# for visualizations

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

plt.style.use('dark_background')

数据加载与探索 (Data Loading and Exploration)

Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test,

and sample_submission file. For this project, my motive is only to perform data analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.

从Kaggle下载数据后，您将拥有3个⽂件。由于这是⼀场预测⽐赛，因此您具有训练，测试和sample_submission⽂件。对于这个项⽬，我的动机只是执⾏数据分析和视觉效果。我将忽略test.csv和sample_submission.csv⽂件。

Let’s load train.csv in data frame using pandas.

让我们使⽤熊猫在数据框中加载train.csv。

%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')# output

CPU times: user 258 ms, sys: 132 ms, total: 389 ms

Wall time: 403 ms

关于数据集： (About the dataset:)

id: Integer unique id of each moviebelongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.budg Let’s have a look at the sample data.

让我们看⼀下样本数据。

train.head()

As we can see that some features have dictionaries, hence I am dropping all such columns for now.

如我们所见，某些功能具有字典，因此我暂时删除所有此类列。

train = train.drop(['belongs_to_collection', 'genres', 'crew',

存款利率2022最新'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)

Now it time to have a look at statistics of the data.

现在该看⼀下数据统计了。

print("Shape of data is ")

train.shape# OutputShape of data is

(3000, 12)

Dataframe information.

数据框信息。

train.info()# Output

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 12 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 3000 non-null int64

1 budget 3000 non-null int64

2 imdb_id 3000 non-null object

3 original_language 3000 non-null object

4 original_title 3000 non-null object

5 popularity 3000 non-null float64

6 poster_path 2999 non-null object

7 release_date 3000 non-null object

8 runtime 2998 non-null float64

9 status 3000 non-null object

10 title 3000 non-null object

11 revenue 3000 non-null int64

dtypes: float64(2), int64(3), object(7)

memory usage: 281.4+ KB

Describe dataframe.

描述数据框。

train.describe()

Let’s create new columns for release weekday, date, month, and year.

让我们为发布⼯作⽇，⽇期，⽉份和年份创建新列。

train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)train['release_day'] = train['release_date'].apply(lambda t: t.day)tra train['release_year'] = train['release_date'].apply(lambda t: t.year ar < 2018 ar -100)

数据分析与可视化 (Data Analysis and Visualization)

村居古诗的意思Image for post

Photo by on

( 在上照⽚

问题1：哪部电影的收⼊最⾼？ (Question 1: Which movie made the highest revenue?)

train[train['revenue'] == train['revenue'].max()]

train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')# Please

The Avengers movie has made the highest revenue.

复仇者联盟电影的收⼊最⾼。

问题2：哪部电影的预算最⾼？ (Question 2 : Which movie has the highest budget?)

train[train['budget'] == train['budget'].max()]

train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu

Pirates of the Caribbean: On Stranger Tides is most expensive movie.

加勒⽐海盗：惊涛怪浪是最昂贵的电影。

问题3：哪部电影是最长的电影？ (Question 3: Which movie is longest movie?)

train[train['runtime'] == train['runtime'].max()]

plt.hist(train['runtime'].fillna(0) / 60, bins=40);

plt.title('Distribution of length of film in hours', fontsize=16, color='white');

plt.xlabel('Duration of Movie in Hours')

plt.ylabel('Number of Movies')

Image for post

train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','reve

Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime.

卡洛斯(Carlos)是最长的电影，有338分钟(5⼩时38分钟)的运⾏时间。

问题4：⼤多数电影在哪⼀年发⾏的？ (Question 4: In which year most movies were released?)

plt.figure(figsize=(20,12))

edgecolor=(0,0,0),

plt.title("Movie Release count by Year",fontsize=20)

plt.xlabel('Release Year')

plt.ylabel('Number of Movies Release')

plt.show()

非常完美吴大伟微博Image for post

train['release_year'].value_counts().head()# Output2013 141

2015 128

2010 126

2016 125

2012 125

Name: release_year, dtype: int64

In 2013 total 141 movies were released.

2013年，总共发⾏了141部电影。

问题5：最受欢迎和最低⼈⽓的电影。 (Question 5 : Movies with Highest and Lowest popularity.)

数据可视化分析票房数据报告_票房收入分析和可视化

发布评论取消回复

最近发表

热门文章

标签列表