⼏个数据分析的⼩实例(《使⽤python进⾏数据分析》)
数据分析⼩实例
⼩实例内容来⾃《利⽤python进⾏数据分析》。
本⽂中可能使⽤的数据集来⾃: 。
MovieLens 1M数据集
⾸先我们可以使⽤ad_table函数来将数据以DataFrame形式读⼊():
import pandas as pd
unames =["user_id","gender","age","occupation","zip"]
users = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/users.dat",sep="::",header=None,names=unames)
rnames =["user_id","movie_id","rating","timestamp"]
ratings = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/ratings.dat",sep="::",header=None,names=rnames)
mnames =["movie_id","title","genres"]
movies = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/movies.dat",sep="::",header=None,names=mnames)
print(users.head())
# user_id gender age occupation zip
# 0 1 F 1 10 48067
# 1 2 M 56 16 70072
# 2 3 M 25 15 55117
# 3 4 M 45 7 02460
段奕宏主演的电视剧# 4 5 M 25 20 55455
print(ratings.head())
# user_id movie_id rating timestamp
# 0 1 1193 5 978300760
# 1 1 661 3 978302109
# 2 1 914 3 978301968
# 3 1 3408 4 978300275
# 4 1 2355 5 978824291
print(movies.head())
# movie_id title genres
# 0 1 Toy Story (1995) Animation|Children's|Comedy
# 1 2 Jumanji (1995) Adventure|Children's|Fantasy
# 2 3 Grumpier Old Men (1995) Comedy|Romance
# 3 4 Waiting to Exhale (1995) Comedy|Drama
# 4 5 Father of the Bride Part II (1995) Comedy
为了⽅便之后的数据处理,我们需要将3个DataFrame合并成为1个DataFrame()。观察到users和ratings都有user_id这⼀项,我们可以将其作为合并键,来进⾏合并,之后合并的数据和movies都有movies_id这⼀项,之后再以此作为键进⾏合并:
data = pd.(ratings,users),movies)
print(data.head())
# user_id movie_id ... title genres
# 0 1 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama
# 1 2 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama
# 2 12 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama
# 3 15 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama
# 4 17 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama
print(data.iloc[0])
# user_id 1
# movie_id 1193
# rating 5
# timestamp 978300760
# gender F
# age 1
# occupation 10
# zip 48067
# title One Flew Over the Cuckoo's Nest (1975)
# genres Drama
# Name: 0, dtype: object
之后我们可以通过数据聚合的⽅式,来计算出男性和⼥性对每部电影的平均评分():
mean_data = upby(["title","gender"])["rating"].mean()
print(mean_data.unstack().head())
# gender F M
# title
# $1,000,000 Duck (1971) 3.375000 2.761905
# 'Night Mother (1986) 3.388889 3.352941
# 'Til There Was You (1997) 2.675676 2.733333
# 'burbs, The (1989) 2.793478 2.962085
# ...And Justice for All (1979) 3.828571 3.689024
另外我们在数据分析的时候,往往会去掉⼀些样本不⾜的项,现在我们可以使⽤数据聚合的⽅式将评分数少于250的电影去掉:ratings_by_title = upby("title").size()
#获取每个分组的⼤⼩
print(ratings_by_title.head())
# title
# $1,000,000 Duck (1971) 37
# 'Night Mother (1986) 70
# 'Til There Was You (1997) 52
# 'burbs, The (1989) 303
# ...And Justice for All (1979) 199
# dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >=250]
#删选出评分数⼤于250的索引
mean_data = mean_data.loc[active_titles]
#通过索引选择⾏,得到筛选后的结果
print(mean_data.head())
# gender F M
# title
# 'burbs, The (1989) 2.793478 2.962085
# 10 Things I Hate About You (1999) 3.646552 3.311966祝大家国庆节快乐
# 101 Dalmatians (1961) 3.791444 3.500000
# 101 Dalmatians (1996) 3.240000 2.911215
# 12 Angry Men (1957) 4.184397 4.328421
然后我们可以使⽤sort_values⽅法来得到⼥性最受欢迎的电影top10:
top_female_ratings = mean_data.sort_values(by="F",ascending=False)
print(top_female_ratings.head(10))
# gender F M
# title
# Close Shave, A (1995) 4.644444 4.473795
# Wrong Trousers, The (1993) 4.588235 4.478261
# Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
# Wallace & Gromit: The Best of 4.563107 4.385075
# Schindler's List (1993) 4.562602 4.491415
# Shawshank Redemption, The (1994) 4.539075 4.560625
# Grand Day Out, A (1992) 4.537879 4.293255
# To Kill a Mockingbird (1962) 4.536667 4.372611
# Creature Comforts (1990) 4.513889 4.272277
# Usual Suspects, The (1995) 4.513317 4.518248
测量评价分歧
我们可以新添加⼀列,其值为男性评分和⼥性评分的差值,以此来表⽰评分差异:
mean_data['diff']= mean_data["M"]- mean_data["F"]
print(mean_data.sort_values(by="diff").head())
# gender F M diff
# # title
# # Dirty Dancing (1987) 3.790378 2.959596 -0.830782
# # Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
# # Grease (1978) 3.975265 3.367041 -0.608224
# # Little Women (1994) 3.870588 3.321739 -0.548849
# # Steel Magnolias (1989) 3.901734 3.365957 -0.535777
假设我们想要得到不依赖性别的最受争议的电影(评分差异⼤),则可以通过评分的标准差或者⽅差来计算:
rating_std_by_title = upby("title")["rating"].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
print(rating_std_by_title.sort_values(ascending=False).head(10))
# title
# Dumb & Dumber (1994) 1.321333
# Blair Witch Project, The (1999) 1.316368
# Natural Born Killers (1994) 1.307198
# Tank Girl (1995) 1.277695
# Rocky Horror Picture Show, The (1975) 1.260177
# Eyes Wide Shut (1999) 1.259624
# Evita (1996) 1.253631
# Billy Madison (1995) 1.249970
# Fear and Loathing in Las Vegas (1998) 1.246408
# Bicentennial Man (1999) 1.245533
# Name: rating, dtype: float64
美国1880~2010年婴⼉名字
在附件babynames⽂件夹中,包含着1880年~2010你年出⽣的婴⼉的名字、性别和该名字的婴⼉的数量,并且只包含出现过超过5次的名字(每⼀年以⼀个txt⽂件的形式出现)。
⾸先我们以1880年的数据为例,对其进⾏读取:
names1880 = pd.read_csv("../pydata-book-2nd-edition/datasets/",
names=["name","sex","number"])
print(names1880.head())
# name sex number
# 0 Mary F 7065
如何清理手机内存# 1 Anna F 2604
# 2 Emma F 2003
# 3 Elizabeth F 1939
# 4 Minnie F 1746制热空调多少度合适
我们可以查看不同性别婴⼉的出⽣数量:
upby("sex").number.sum())
# F 90993
火车票几点开售# M 110493
# Name: number, dtype: int64
由于数据分布在多个⽂件中(从~),⾸先需要做的事情就是讲数据集中在⼀个DataFrame中,并为每⼀⾏数据添加⼀个年份的标签:
years =range(1880,2010)
pieces =[]
colunms =["mame","sex","number"]
for year in years:
path ="../pydata-book-2nd-edition/datasets/babynames/"% year
frame = pd.read_csv(path,names=colunms)
frame["year"]= year
pieces.append(frame)
names = pd.concat(pieces,ignore_index=True)
#将读取的各个frame按⾏连接(默认),并且重新建⽴⾏索引
print(names.head())
# mame sex number year
# 0 Mary F 7065 1880
# 1 Anna F 2604 1880
# 2 Emma F 2003 1880
# 3 Elizabeth F 1939 1880
# 4 Minnie F 1746 1880
之后我们就可以进⾏数据的聚合了,⽐如每年的男婴和⼥婴的数量的统计,之后进⾏数据的可视化:
total_births = upby(["year","sex"])["number"].sum().unstack()
print(total_births.head(10))
# sex F M
# year
# 1880 90993 110493
# 1881 91955 100748十一法定假日几天
# 1882 107851 113687
# 1883 112322 104632
# 1884 129021 114445
# 1885 133056 107802
# 1886 144538 110785
# 1887 145983 101412
# 1888 178631 120857
# 1889 178369 110590
total_births.plot(title ="Total birth by sex and year")
plt.show()
现在我们插⼊⼀个prop列,表⽰该名字的婴⼉该年同性别婴⼉总数的百分⽐:
def add_prop(group):
group["prop"]= group.number / group.number.sum()
发布评论