【python实现⽹络爬⾍(14)】python爬取酷狗中多类型⾳乐步骤详解(附全
部源代码)
⽬标⽹址:,页⾯如下
爬⾍逻辑:
注意:
这两个获取url的顺序是和之前获取url的过程是反过来的,以往是获取外部页⾯的url后进⼊内部页⾯的url,然后再获取该页⾯的信息。
但是下载⾳乐(定向爬取数据),⾸先是要确定获取⾳乐的url(通过浏览器输⼊后点击可以直接播放–内部页⾯数据),然后再其上⼀层的url(资源链接的url–外部页⾯数据)
1. ⽹页结构分析
1) 到所要获取⾳乐的url
在⽬标页⾯⿏标右键选择’检查’,右上⽅菜单栏点击'Network',后进⾏⽹页刷新,接着查右下区中php相关的⽂件,随后在'Preview'选项下点击'data',查到'play_url',复制后⾯的内容使⽤浏览器打开后,就可以直接播放。图解如下
2) 到资源链接的ur
在上述的界⾯点击'Preview'旁边的'Headers'菜单栏,这时候发现'General'下的第⼀个信息就是资源链接的url,如下
该页⾯下⽅还有⼀个信息,如下,可以对⽐url⾥⾯的内容和下⾯的信息
3) 简化资源链接的url
通过上⾯的对⽐,可以发现,url⾥⾯的内容除了主站域名外,其他的⼏乎都是有可确定的字段拼接⽽成的,可以尝试将字段进⾏删减,⽐如先去掉最后的&_=1584364814789数据,看看⽹页是否返回数据,其次再往上⼀个字段的数据进⾏尝试,直到⽆法返回数据为⽌。通过测试发现,当把hash对应的
数据删除后,⽹站不返回请求数据了。因此简化的请求资源链接的url就如下
wwwapi.kugou/yy/index.php?r=play/getdata&callback=jQuery19103336613592709623_1584364814787&hash=07606F202459F44A46201320 2A2839BD
–> 输出结果为:(⾄此两个url就获取完毕了)
2. 封装第⼀个函数
⾸先导⼊相关的库和设定相关的参数
import requests
import time
import math
import re
import os
import json
from bs4 import BeautifulSoup
1) url参数的设置
电压力锅 高压锅要爬取资源的url基础元素就是主站域名加上查数据返回的⽂件信息(index.php?),其中data⾥⾯的数据(url基础元素后⾯的搜索参数),就是上⼀步测试简化url所对应的数据,因为测试到删除hash字段数据对应的⽹址不再返回页⾯数据信息,所以需要保留,那么hash 之前的字段数据⾃然也需要保留了。
url ='wwwapi.kugou/yy/index.php?'
data ={
'r':'play/getdata',
'callback':'jQuery19108922952455208721_1584362904730',
'hash':'07606F202459F44A462013202A2839BD'
}
其中关于’callback’参数⾥⾯的1584362904730数据,是⼀个时间计时,可以对应time库⾥⾯的.time⽅法。由此可以⾃⼰创造⼀个时间计时(代表着访问时间)
2) 请求头设定
火炬之光2 联机User-Agent和Referer数据都可以在当前的页⾯进⾏到,但是没有cookie信息
树葡萄cookie信息的获取,可以随便的点击⼀个有关post请求信息的页⾯,如下罗志祥代言费
最后构建的请求头如下:
dic_headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'cookie':'kg_mid=e9f7036c9e3f7b3b8e5f31d8c437a650; kg_dfid=1aF1fa3fRahL0i1GZz3RYp8h; _WCMID=1648cadf5e0f206e4bca9435; kg_dfid_collect= d41d8cd98f00b204e9800998ecf8427e; Hm_lvt_aedee6983d4cfc62f509129360d6bb3d=1584362882; Hm_lpvt_aedee6983d4cfc62f5091293
60d6bb3d=15 84362905; kg_mid_temp=e9f7036c9e3f7b3b8e5f31d8c437a650',
'Referer':'www.kugou/song/'
}
3) 函数封装
① 初步封装获取返回的⽂本数据
def get_musci():
url ='wwwapi.kugou/yy/index.php?'
data ={
'r':'play/getdata',
'callback':'jQuery19108922952455208721_{}'.format(math.floor(time.time()*1000)),
'hash':'07606F202459F44A462013202A2839BD'
}
dic_headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'cookie':'kg_mid=e9f7036c9e3f7b3b8e5f31d8c437a650; kg_dfid=1aF1fa3fRahL0i1GZz3RYp8h; _WCMID=1648cadf5e0f206e4bca9435; kg_dfid_collect= d41d8cd98f00b204e9800998ecf8427e; Hm_lvt_aedee6983d4cfc62f509129360d6bb3d=1584362882; Hm_lpvt_aedee6983d4cfc62f509129360d6bb3d=15 84362905; kg_mid_temp=e9f7036c9e3f7b3b8e5f31d8c437a650',
'Referer':'www.kugou/song/'
}
html = (url,params=data, headers = dic_headers)
)
get_musci()
–> 输出结果为:(输出的结果也就是上⾯简化url测试时候⽹页返回的数据)
jQuery19108922952455208721_1584362904730({"status":1,"err_code":0,"data":
{"hash":"07606F202459F44A462013202A2839BD","timelength":266045,"filesize":4263645,
"audio_name":"HITA - \u8d64\u4f36","have_album":1,
"album_name":"\u8d64\u4f36","album_id":"14939533",
"img":"http:\/\/imge.kugou\/stdmusic\/20190130\/20190130172751733550.jpg",
"have_mv":1,"video_id":"1449487","author_name":"HITA",
"song_name":"\u8d64\u4f36",
......
"play_backup_url":"https:\/\/webfs.cloud.kugou\/20200316223\/dc5586939d67e36282f1fdf34d313860\/G093\/M04\/1E\/15\/_YYBAFu5_rmAfzpPAEEO 3Q5ZQDY336.mp3"
}
糙米粥});
② ⽂本数据清洗转化为可识别类型数据尹施允资料
输出结果发现和之前获取腾讯新闻返回的结果有点类似,需要将数据转化为可识别的类型,然后进⾏程序导⼊,这⾥如果还按照数数的⽅法就有点效率低下了,使⽤.index的⽅法进⾏
start = index('{')
end = index('})')+1
json_data = json.[start:end])
print(json_data)
–> 输出结果为:
{'status':1,'err_code':0,'data':
{'hash':'07606F202459F44A462013202A2839BD','timelength':266045,
'filesize':4263645,'audio_name':'HITA - ⾚伶','have_album':1,
'album_name':'⾚伶','album_id':'14939533',
'img':'imge.kugou/stdmusic/20190130/20190130172751733550.jpg',
'have_mv':1,'video_id':'1449487','author_name':'HITA',
'song_name':'⾚伶','lyrics':'\ufeff[id:$00000000]\r\n[ar:HITA]\r\n[ti:⾚伶]\r\n[by:]\r\n[hash:07606f202459f44a462013202a2839bd]\r\n[al:]\r\n[sign:]\r\n[qq:]\r\ n[total:266045]\r\n[offset:0]\r\n[00:00.78]HITA - ⾚伶\r\n[00:01.74]作词:清彦\r\n[00:02.85]作曲:李建衡\r\n[00:04.33]编曲:何天程\r\n[00:05.70]昆曲念⽩:朱虹\r\n[00:06.91]混⾳:何天程\r\n[00:08.13]⼆胡:钟意\r\n[00:09.04]笛⼦:笛呆⼦囚⽜\r\n[00:32.56]戏⼀折⽔袖起落\r\n[00:38.31]唱悲欢唱离合⽆关我\r\n[0 0:45.34]扇开合锣⿎响⼜默\r\n[00:51.26]戏中情戏外⼈凭谁说\r\n[00:57.68]惯将喜怒哀乐都融⼊粉墨\r\n[01:02.89]陈词唱穿⼜如何\r\n[01:06.28]⽩⾻青灰皆我\r\ n[01:10.52]乱世浮萍忍看烽⽕燃⼭河\r\n[01:15.88]位卑未敢忘忧国\r\n[01:18.96]哪怕⽆⼈知我\r\n[01:23.06]台下⼈⾛过不见旧颜⾊\r\n[01:29.43]台上⼈唱着⼼碎离别歌\r\n[01:36.06]情字难落墨\r\n[01:38.79]她唱须以⾎来和\r\n[01:42.83]戏幕起戏幕落谁是客\r\n[01:53.86]啊\r\n[01:54.82]浓情悔认真\r\n[01:59.85]回头皆幻景\r\n[02:06.73]对⾯是何⼈\r\n[02:19.54]戏⼀折⽔袖起落\r\n[02:25.04]唱悲欢唱离合⽆关我\r\n[02:3
2.38]扇开合锣⿎响⼜默\r\n[02:37.99]戏中情戏外⼈凭谁说\r\n[02:44.61]惯将喜怒哀乐都藏⼊粉墨\r\n[02:49.87]陈词唱穿⼜如何\r\n[02:53.20]⽩⾻青灰皆我\r\n[02:57.55]乱世浮萍忍看烽⽕燃⼭河\r\n[03:02.66]位卑未敢忘忧国\r\n[03:06.00]哪怕⽆⼈知我\r\n[03:10.10]台下⼈⾛过不见旧颜⾊\r\n[03:16.52]台上⼈唱着⼼碎离别歌\r\n[03:23.04]情字难落墨\r\n[03:25.82]她唱须以⾎来和\r\n[03:29.86]戏幕起戏幕落终是客\r\n[03:39.16]你⽅唱罢我登场\r\n[03:45.89]莫嘲风⽉戏莫笑⼈荒唐\r\n[03:52.31]也曾问青黄\r\n[03:55.09]也曾铿锵唱兴亡\r\n[03:58.99]道⽆情道有情怎思量\r\n[04:08.76]道⽆情道有情费思量\r\n',
'author_id':'84981','privilege':8,'privilege2':'1000',
'play_url':'webfs.yun.kugou/202003162300/2842f18911bdac380c74dcc270a7ab21/G093/M04/1E/15/_YYBAFu5_rmAfzpPAEEO3Q5ZQDY336. mp3',
'authors':[{'author_id':'84981','is_publish':'1','sizable_avatar':'singerimg.kugou/uploadpic/softhead/{size}/20191128/20191128094941269.jpg', 'author_name':'HITA','avatar':'singerimg.kugou/uploadpic/softhead/400/20191128/20191128094941269.jpg'}],
'is_free_part':0,'bitrate':128,'audio_id':'44024421',
'play_backup_url':'webfs.cloud.kugou/202003162300/c30ab2260877af66297571054b9d03b2/G093/M04/1E/15/_YYBAFu5_rmAfzpPAEEO3Q5 ZQDY336.mp3'}}
③ 获取⾳乐名称和具体的⾳乐url
song_name = json_data['data']['song_name']
song_author = json_data['data']['author_name']
song_url = json_data['data']['play_url']
name = song_name+'-'+ song_author
print(song_url)
print(name)
–> 输出结果为:(下⾯的⽹址对应:)
webfs.yun.kugou/202003162306/138ff21bf052e04ec4a7ef07ebd2c514/G093/M04/1E/15/_YYBAFu5_rmAfzpPAEEO3Q5ZQDY336.mp3
⾚伶-HITA
发布评论