使用python爬虫抓取学术论文

使⽤python爬⾍抓取学术论⽂

介绍

这是⼀个很⼩的爬⾍，可以⽤来爬取学术引擎的pdf论⽂，由于是⽹页内容是js⽣成的，所以必须动态抓取。通过selenium和chromedriver实现。可以修改起始点的URL从⾕粉搜搜改到⾕歌学术引擎，如果你的电脑可以。可以修改关键字和搜索页数搜索需要的论⽂

资源下载

python代码

#!/usr/bin/python

#encoding=utf-8

__author__ = 'Administrator'

from selenium import selenium

if __name__ == "__main__":

import os

from selenium import webdriver

from selenium.webdriver.support.ui import WebDriverWait

chromedriver = "C:\Program Files\Google\Chrome\"

driver = webdriver.Chrome(chromedriver)

<('www.gfsoso/scholar')

inputElement = driver.find_element_by_name("q")

searchWord="sentiment lexicon"

inputElement.send_keys(searchWord)

文件名提取

inputElement.submit()

currentURL=driver.current_url

urlList=[]

localDir = 'down_pdf\\'

fileOut = localDir + searchWord + ".txt"

import urllib, re,codecs,sys

fileOp = codecs.open(fileOut, 'a', defaultencoding())

for i in range(0,10):#需要抓取的页数

pdf_url = driver.find_elements_by_css_selector("a")

for k in pdf_url:

try:

z= k.get_attribute("href")

if'.pdf'in z and z not in urlList:

urlList.append(z)

print z

except:

import time

time.sleep(1)

continue

contents=driver.find_elements_by_css_selector('h3')

for ct in contents:

#fileOp.write('%s\n' %(ct.text))#把页⾯上所有的⽂章名称存到txt，有时会报错

<(currentURL+"&start="+str(i*10)+"&as_sdt=0,5&as_ylo=2008")

import time

time.sleep(3)

print len(urlList)

for everyURL in urlList: #遍历列表的每⼀项，即每⼀个PDF的url

wordItems = everyURL.split('/') #将url以/为界进⾏划分，为了提取该PDF⽂件名for item in wordItems: #遍历每个字符串

if re.match('.*\.pdf$', item): #查PDF的⽂件名

PDFName = item #查到PDF⽂件名

localPDF = localDir +searchWord+"_"+ PDFName

try:

urllib.urlretrieve(everyURL, localPDF) #按照url进⾏下载，并以其⽂件名存储到本地⽬录except Exception,e:

continue

使用python爬虫抓取学术论文

发布评论取消回复

最近发表

热门文章

标签列表