首页 > 综合资讯文章详情

python小说分析_Python起点小说数据分析

python⼩说分析_Python起点⼩说数据分析

起点，作为⼀个8年的⽼书⾍肯定是知道。既然学习了数据分析，就看看起点的数据。

1 获取数据

⾸先，肯定要先获取数据，巧妇难为⽆⽶之炊，没有数据也是⽩搭。没有现成的数据，只能通过爬⾍来爬取我们需要的数据，这⾥就不写怎么获取数据了。爬⾍的代码是我写的第⼀个爬⾍，准确的说应该是复制粘贴。写得也真是够烂的，爬取过程⼀直断，只好分⼩说类型⼀点⼀点爬取。庆幸的是起点中⽂⽹并没有什么反爬⾍措施，不然连数据都拿不到。

主要爬取的内容有：

属性

说明

id

⼩说在起点的id

title

⼩说名

author

作者

chapter_nums

章节数

word_nums

字数

last_update_date

最后更新时间

first_update_date

第⼀次更新时间

category

⼀级分类

sub_category

⼆级分类

rate

评分

discuss_nums

讨论数

click_nums

点击数

commend_nums

推荐数

sex

性别

crawl_time

爬取时间

爬取到的数据存储到mysql，主要是对mysql⽐较熟悉，数据量有点⼤存到⽂本就有点不适合了。

2 分析

导包和配置

%matplotlib inline

import pandas as pd出版言情小说

import matplotlib.pyplot as plt

from wordcloud import WordCloud

import jieba

import re

import PIL

from datetime import datetime

from datetime import timedelta

import seaborn as sns

# 中⽂乱码设置

# 设置x, y轴刻度字体⼤⼩

plt.style.use('ggplot')

2.1 读取数据

读取数据库使⽤SQLAlchemy，不过SQLAlchemy本⾝是⽆法操作数据库，需要安装mysql驱动。SQLAlchemy结合pandas使⽤读取数据还是很⽅便的。

import sqlalchemy as sqla

db = ate_engine('mysql+mysqlconnector://root:123456@localhost:3306/article_spider')

df_novels = pd.read_sql('select * from qidian_novel', db)

2.2 数据预处理

⼀般情况下，数据是不⼲净的，数据清洗会花去很多的时间。据说，做到了⼀定程度，如果数据是⼲净的，反⽽会觉得很不舒服。

df_novels.info()

有⼀列是空的，discuss_nums 评论数没有抓取，然后还有⼀些存在空值。

删除空列

df_novels = df_novels.drop('discuss_nums', axis=1)

过滤缺失数据

df_novels = df_novels.dropna(how='any')

2.3 数据分析

2.3.1 连载与完本

在爬取的数据中，接近100万条，也就是有100万本⼩说。

status_counts = df_novels['novel_status'].value_counts()

status_labels = ['连载', '完本', '暂停']

plt.figure(figsize=(16, 8))

ax1 = plt.subplot(121)

sns.barplot(status_labels, status_counts, alpha=0.8)

ax1.set_ylabel('')

ax1.set_title('⼩说分布', size=26)

for x, y in zip(range(3), status_counts.values):

<(x, y, '%d(%.1f%%)' %(y, y/966022*100), ha='center', va='bottom',fontsize=16, color='b')

可以从图中看出，完本⼩说只占到6%，连载中的⼩说占了近90%，难道起点中⽂⽹这么⽕爆吗？

如果对起点、纵横等⼩说⽹站的写作有所了解，对这个数据不会感到多少意外。毕竟写作门槛很低，随便创个账号就好了，很多⼈都会想着去创作。很多⼩说只有⼏章的，因为发现写了⼀点就写不下去了。曾经我也有过这个念头，只是没有⾏动过。

chapter_zeroToTen = df_novels[(df_novels['chapter_nums'] <= 10) & (df_novels['novel_status'] == '连载')].unt()

chapter_all = df_novels[(df_novels['chapter_nums'] > 10) & (df_novels['novel_status'] == '连载')].unt()

chapter_data = [chapter_zeroToTen, chapter_all]

labels = ['10章以下的⼩说', '11章以上的⼩说']

ax = plt.figure(figsize=(8, 8)).add_subplot(111)

patches, texts, autotexts = ax.pie(chapter_data, labels=labels, autopct='%.1f%%', startangle=90, colors=['wheat', 'skyblue'])

for t in texts:

t.set_size('xx-large')

for at in autotexts:

at.set_size('xx-large')

plt.title('10章以下的⼩说⽐例', size=26)

plt.show()

10章以下的⼩说占了连载⼩说的67.1%，⼀般的⽹络⼩说会⼀天1到3更，⼏⼗万⼩说不可能是这⼏天才开始写的，所以说90多万⼩说⽔分还是极⼤的。

数据没有爬取签约⼩说这个标签，如果⼩说总量以签约⼩说来算应该更为合适。

2.3.2 性别

起点分男⽣⽹和⼥⽣⽹，男⽣⽹⾯向的男读者，⼥⽣⽹⾯向的是⼥读者。当然，如果男读者看⼥⽣⽹的⼩说是没有问题，肯定没有⼈会限制你。

sex_counts = df_novels['sex'].value_counts()

labels = ['男⽣', '⼥⽣']

ax = plt.figure(figsize=(8, 8)).add_subplot(111)

patches, texts, autotexts = ax.pie(sex_counts.values, labels=labels, autopct='%.1f%%', startangle=90, colors=['wheat',

'skyblue'])

for t in texts:

t.set_size('xx-large')

for at in autotexts:

at.set_size('xx-large')

plt.title('男⼥⽣⼩说数量⽐例', size=26)

plt.show()

从图中，可以看出起点⼩说主要还是男⽣⼩说为主，现实中沉迷于⽹络⼩说的也是男的多，相应的男⽣⽹的⼩说⽐较多是⽐较正常的。

2.3.3 ⼩说类型

很多时候，没有看过⽹络⼩说的⼈认为看⽹络⼩说都是看武侠⼩说。其实⼩说分类很多，武侠⼩说只是其实⼀种，在⽹络⼩说中，武侠⼩说反⽽不是那么流⾏。由于连载⼩说有⼤部分是只有⼏章的，这⾥只分析完本的。

novel_wb = df_novels[df_novels['novel_status'] == '完本']

novels_gg = novel_wb[novel_wb['sex'] == 'gg']

novels_gg_counts = novels_gg.category.value_counts()

ax = plt.figure(figsize=(16, 8)).add_subplot(111)

ax.set_xlabel('')

ax.set_ylabel('')

ax.set_title('男⽣⽹⼩说⼀级分类', size=26)

# 设置标签

for a, b in zip(range(15), novels_gg_counts):

<(a, b, '%d' % b, ha='center', va='bottom',fontsize=16, color='b')

plt.show()

这是男⽣⽹分类⼩说分类，最多的还是⽞幻⼩说，说得⽐较多的武侠⼩说反⽽⽐较少。

novels_mm = novel_wb[novel_wb['sex'] == 'mm']

novels_mm_counts = novels_mm.category.value_counts()

ax = plt.figure(figsize=(12, 8)).add_subplot(111)

ax.set_xlabel('')

ax.set_ylabel('')

ax.set_title('⼥⽣⽹⼩说⼀级分类', size=26)

# 设置标签

for a, b in zip(range(9), novels_mm_counts):

<(a, b, '%d' % b, ha='center', va='bottom',fontsize=16, color='b')

plt.show()

从⼥⽣⽹⼩说看，类型划分相对较少，⼀般我们认为⼥⽣⼩说就是⾔情⼩说，其实也没错，主要还是⾔情⼩说为主，只是⾔情⼩说还有分类。

上⾯说的分类都是⼀级分类，在每⼀个分类⼜细分出很多⼩说类型。这⾥只看看看⼀级类型书最多的的⼦类型。

⽞幻

category_xh = novel_wb[novel_wb['category'] == '⽞幻']

category_xh_counts = category_xh.sub_category.value_counts()

ax = plt.figure(figsize=(8, 6)).add_subplot(111)

sns.barplot(x=category_xh_counts.index, y=category_xh_counts, order=category_xh_counts.index)

ax.set_ylabel('')

ax.set_title('⽞幻⼩说的⼦分类', size=26)

# 设置标签

for a, b in zip(range(4), category_xh_counts):

<(a, b, '%d' % b, ha='center', va='bottom',fontsize=16, color='b')

plt.show()

⽞幻⼩说下⾯的⼦分类并不多，主要还是以东⽅⽞幻和异世⼤陆为主。

现代⾔情

category_xdyq = novel_wb[novel_wb['category'] == '现代⾔情']

category_xdyq_counts = category_xdyq.sub_category.value_counts()

ax = plt.figure(figsize=(14, 6)).add_subplot(111)

sns.barplot(y=category_xdyq_counts.index, x=category_xdyq_counts, order=category_xdyq_counts.index)

ax.set_xlabel('')

ax.set_title('现代⾔情⼩说的⼦分类', size=26)

本文发布于:2025-03-07 10:30:28，感谢您对本站的认可！

本文链接:https://www.yfs8.com/news/579652.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

数据没有爬取起点签约需要

上一篇：小说网站介绍
下一篇：张恨水小说的叙事技巧研究——以《落霞孤鹜》为例

发布评论取消回复

评论列表（有 0 条评论）

热门文章