爬虫性能：NodeJs VS Python

2016 年 12 月 25 日

爬虫项目

众筹网-众筹中项目 http://www.zhongchou.com/brow…，我们就以这个网站为例，我们爬取它所有目前正在众筹中的项目，获得每一个项目详情页的URL，存入txt文件中。

实战比较

python原始版

# -*- coding:utf-8 -*-

”‘

Created on 20160827

@author: qiukang

‘”

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#请求头

headers = {

‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,

‘Accept-Encoding’:‘gzip, deflate, sdch’,

‘Accept-Language’:‘zh-CN,zh;q=0.8’,

‘Connection’:‘keep-alive’,

‘Host’:‘www.zhongchou.com’,

‘Upgrade-Insecure-Requests’:1,

‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36’

}

# 获得项目url列表

def getItems(allpage):

no = 0

items = open(‘pystandard.txt’,‘a’)

for page in range(allpage):

if page==0:

url = ‘http://www.zhongchou.com/browse/di’

else:

url = ‘http://www.zhongchou.com/browse/di-p’+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode(‘utf8’)

soup = BeautifulSoup(html);

lists = soup.findAll(attrs={“class”:“ssCardItem”})

for i in range(len(lists)):

href = lists[i].a[‘href’]

items.write(href+“n”)

no +=1

items.close()

return no

if __name__ == ‘__main__’:

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print(‘it takes %s Seconds to get %s items ‘%(end–start,no))

实验5次的结果：

it takes 48.1727159614 Seconds to get 720 items

it takes 45.3397999415 Seconds to get 720 items

it takes 44.4811429862 Seconds to get 720 items

it takes 44.4619293082 Seconds to get 720 items

it takes 46.669706593 Seconds to get 720 items

python多线程版

Python

# -*- coding:utf-8 -*-

”’

Created on 20160827

@author: qiukang

”’

import requests,time,threading

from BeautifulSoup import BeautifulSoup # HTML

#请求头

headers = {

‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,

‘Accept-Encoding’:‘gzip, deflate, sdch’,

‘Accept-Language’:‘zh-CN,zh;q=0.8’,

‘Connection’:‘keep-alive’,

‘Host’:‘www.zhongchou.com’,

‘Upgrade-Insecure-Requests’:1,

‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36’

}

items = open(‘pymulti.txt’,‘a’)

no = 0

lock = threading.Lock()

# 获得项目url列表

def getItems(urllist):

# print urllist #①

global items,no,lock

for url in urllist:

r1 = requests.get(url,headers=headers)

html = r1.text.encode(‘utf8’)

soup = BeautifulSoup(html);

lists = soup.findAll(attrs={“class”:“ssCardItem”})

for i in range(len(lists)):

href = lists[i].a[‘href’]

lock.acquire()

items.write(href+“\n”)

no +=1

# print no

lock.release()

if __name__ == ‘__main__’:

start = time.clock()

allpage = 30

allthread = 30

per = (int)(allpage/allthread)

urllist = []

ths = []

for page in range(allpage):

if page==0:

url = ‘http://www.zhongchou.com/browse/di’

else:

url = ‘http://www.zhongchou.com/browse/di-p’+str(page+1)

urllist.append(url)

for i in range(allthread):

# print urllist[i*(per):(i+1)*(per)]

th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],))

th.start()

th.join()

items.close()

end = time.clock()

print(‘it takes %s Seconds to get %s items ‘%(end–start,no))

实验五次的结果：

Python

it takes 45.5222291114 Seconds to get 720 items

it takes 46.7097831417 Seconds to get 720 items

it takes 45.5334646156 Seconds to get 720 items

it takes 48.0242797553 Seconds to get 720 items

it takes 44.804855018 Seconds to get 720 items

这个多线程并没有优势，经过 #① 的注释与否发现，这个所谓的多线程也是按照单线程运行的。

python改进

单线程

首先我们把解析html的步骤改进一下，分析发现

1	lists = soup.findAll(‘a’,attrs={“class”:“siteCardICH3”})

比

1	lists = soup.findAll(attrs={“class”:“ssCardItem”})

更好，因为它是直接找 a ，而不是先找 div 再找 div 下的 a
改进后实验5次结果如下，可见有进步：

it takes 41.0018861912 Seconds to get 720 items

it takes 42.0260390497 Seconds to get 720 items

it takes 42.249635988 Seconds to get 720 items

it takes 41.295524133 Seconds to get 720 items

it takes 42.9022894154 Seconds to get 720 items

多线程

修改 getItems(urllist) 为 getItems(urllist，thno)
函数起止加入 print thno," begin at",time.clock() 和 print thno," end at",time.clock()。结果：

0 begin at 0.00100631078628

0 end at 1.28625832936

1 begin at 1.28703230691

1 end at 2.61739476075

2 begin at 2.61801291642

2 end at 3.92514717937

3 begin at 3.9255829208

3 end at 5.38870235361

4 begin at 5.38921134066

4 end at 6.670658786

5 begin at 6.67125734731

5 end at 8.01520989534

6 begin at 8.01566383155

6 end at 9.42006780585

7 begin at 9.42053340537

7 end at 11.0386755513

8 begin at 11.0391565464

8 end at 12.421359168

9 begin at 12.4218294329

9 end at 13.9932716671

10 begin at 13.9939957256

10 end at 15.3535799145

11 begin at 15.3540870354

11 end at 16.6968289314

12 begin at 16.6972665389

12 end at 17.9798803157

13 begin at 17.9804714125

13 end at 19.326706238

14 begin at 19.3271438455

14 end at 20.8744308886

15 begin at 20.8751017624

15 end at 22.5306500245

16 begin at 22.5311450156

16 end at 23.7781693541

17 begin at 23.7787245279

17 end at 25.1775114499

18 begin at 25.178350742

18 end at 26.5497330734

19 begin at 26.5501776789

19 end at 27.970799259

20 begin at 27.9712727895

20 end at 29.4595075375

21 begin at 29.4599959972

21 end at 30.9507299602

22 begin at 30.9513989679

22 end at 32.2762763982

23 begin at 32.2767182045

23 end at 33.6476256057

24 begin at 33.648137392

24 end at 35.1100517711

25 begin at 35.1104907783

25 end at 36.462657099

26 begin at 36.4632234696

26 end at 37.7908515759

27 begin at 37.7912845182

27 end at 39.4359928956

28 begin at 39.436448698

28 end at 40.9955021593

29 begin at 40.9960871912

29 end at 42.6425665264

it takes 42.6435882327 Seconds to get 720 items

可见这些线程是真的没有并发执行，而是顺序执行的，并没有达到多线程的目的。问题在哪里呢？原来
我的循环中

1 2	th.start() th.join()

两行代码是紧接着的，所以新的线程会等待上一个线程执行完毕才会start，修改为

for i in range(allthread):

# print urllist[i*(per):(i+1)*(per)]

th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i))

ths.append(th)

for th in ths:

th.start()

for th in ths:

th.join()

结果：

0 begin at 0.0010814225325

1 begin at 0.00135201143191

2 begin at 0.00191744892518

3 begin at 0.0021311208492

4 begin at 0.00247495536449

5 begin at 0.0027334144167

6 begin at 0.00320601192551

7 begin at 0.00379011072218

8 begin at 0.00425431064445

9 begin at 0.00511692939449

10 begin at 0.0132038052264

11 begin at 0.0165926979253

12 begin at 0.0170886220634

13 begin at 0.0174665134574

14 begin at 0.018348726576

15 begin at 0.0189780790334

16 begin at 0.0201896641572

17 begin at 0.0220576606283

18 begin at 0.0231484138125

19 begin at 0.0238804034387

20 begin at 0.0273901280772

21 begin at 0.0300363009005

22 begin at 0.0362878375422

23 begin at 0.0395512329756

24 begin at 0.0431556637289

25 begin at 0.0459581249682

26 begin at 0.0482254733323

27 begin at 0.0535430117384

28 begin at 0.0584971212607

29 begin at 0.0598136762161

16 end at 65.2657542222

24 end at 66.2951247811

21 end at 66.3849747583

4 end at 66.6230160119

5 end at 67.5501632164

29 end at 67.7516992283

23 end at 68.6985322418

7 end at 69.1060433231

22 end at 69.2743398214

2 end at 69.5523713152

14 end at 69.6454986837

15 end at 69.8333400981

12 end at 69.9508018062

10 end at 70.2860348602

26 end at 70.3670659719

13 end at 70.3847232972

27 end at 70.3941635841

11 end at 70.5132838156

1 end at 70.7272351926

0 end at 70.9115253609

6 end at 71.0876563409

8 end at 71.112480539825

end at 71.1145248855

3 end at 71.4606034226

19 end at 71.6103622486

18 end at 71.6674453096

20 end at 71.725601862

17 end at 71.7778992318

9 end at 71.7847479301

28 end at 71.7921004837

it takes 71.7931912368 Seconds to get 720 items

反思

上面的的多线是并发了，可是比单线程运行时间长了太多……我还没找出来原因，猜想是不是beautifulsoup不支持多线程？请各位多多指教。为了验证这个想法，我准备不用beautifulsoup,直接使用字符串查找。首先还是从单线程的修改：

Python

# -*- coding:utf-8 -*-

”’

Created on 20160827

@author: qiukang

”’

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#请求头

headers = {

‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,

‘Accept-Encoding’:‘gzip, deflate, sdch’,

‘Accept-Language’:‘zh-CN,zh;q=0.8’,

‘Connection’:‘keep-alive’,

‘Host’:‘www.zhongchou.com’,

‘Upgrade-Insecure-Requests’:‘1’,

‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36’

}

# 获得项目url列表

def getItems(allpage):

no = 0

data = set()

for page in range(allpage):

if page==0:

url = ‘http://www.zhongchou.com/browse/di’

else:

url = ‘http://www.zhongchou.com/browse/di-p’+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode(‘utf8’)

start = 5000

while True:

index = html.find(“deal-show”, start)

if index == –1:

break

# print “http://www.zhongchou.com/deal-show/”+html[index+10:index+19]+”\n”

# time.sleep(100)

data.add(“http://www.zhongchou.com/deal-show/”+html[index+10:index+19]+“\n”)

start = index + 1000

items = open(‘pystandard.txt’,‘a’)

items.write(“”.join(data))

items.close()

return len(data)

if __name__ == ‘__main__’:

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print(‘it takes %s Seconds to get %s items ‘%(end–start,no))

实验3次，结果：

it takes 11.6800132309 Seconds to get 720 items

it takes 11.3621804427 Seconds to get 720 items

it takes 11.6811991567 Seconds to get 720 items

然后对多线程进行修改：

Python

# -*- coding:utf-8 -*-

”’

Created on 20160827

@author: qiukang

”’

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#请求头

headers = {

‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,

‘Accept-Encoding’:‘gzip, deflate, sdch’,

‘Accept-Language’:‘zh-CN,zh;q=0.8’,

‘Connection’:‘keep-alive’,

‘Host’:‘www.zhongchou.com’,

‘Upgrade-Insecure-Requests’:‘1’,

‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36’

}

# 获得项目url列表

def getItems(allpage):

no = 0

data = set()

for page in range(allpage):

if page==0:

url = ‘http://www.zhongchou.com/browse/di’

else:

url = ‘http://www.zhongchou.com/browse/di-p’+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode(‘utf8’)

start = 5000

while True:

index = html.find(“deal-show”, start)

if index == –1:

break

# print “http://www.zhongchou.com/deal-show/”+html[index+10:index+19]+”\n”

# time.sleep(100)

data.add(“http://www.zhongchou.com/deal-show/”+html[index+10:index+19]+“\n”)

start = index + 1000

items = open(‘pystandard.txt’,‘a’)

items.write(“”.join(data))

items.close()

return len(data)

if __name__ == ‘__main__’:

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print(‘it takes %s Seconds to get %s items ‘%(end–start,no))

实验三次的结果：

Python

it takes 1.4781525123 Seconds to get 720 items

it takes 1.44905954029 Seconds to get 720 items

it takes 1.49297891786 Seconds to get 720 items

可见多线程确实比单线程快好多倍。对于简单的爬取任务而言，用字符串的内置方法比用beautifulsoup解析html快很多。

NodeJs

Python

# -*- coding:utf-8 -*-

”’

Created on 20160827

@author: qiukang

”’

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#请求头

headers = {

‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,

‘Accept-Encoding’:‘gzip, deflate, sdch’,

‘Accept-Language’:‘zh-CN,zh;q=0.8’,

‘Connection’:‘keep-alive’,

‘Host’:‘www.zhongchou.com’,

‘Upgrade-Insecure-Requests’:‘1’,

‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36’

}

# 获得项目url列表

def getItems(allpage):

no = 0

data = set()

for page in range(allpage):

if page==0:

url = ‘http://www.zhongchou.com/browse/di’

else:

url = ‘http://www.zhongchou.com/browse/di-p’+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode(‘utf8’)

start = 5000

while True:

index = html.find(“deal-show”, start)

if index == –1:

break

# print “http://www.zhongchou.com/deal-show/”+html[index+10:index+19]+”\n”

# time.sleep(100)

data.add(“http://www.zhongchou.com/deal-show/”+html[index+10:index+19]+“\n”)

start = index + 1000

items = open(‘pystandard.txt’,‘a’)

items.write(“”.join(data))

items.close()

return len(data)

if __name__ == ‘__main__’:

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print(‘it takes %s Seconds to get %s items ‘%(end–start,no))

实验五次的结果：

Python

it takes 3.949 Seconds to get 720 items

it takes 3.642 Seconds to get 720 items

it takes 3.641 Seconds to get 720 items

it takes 3.938 Seconds to get 720 items

it takes 3.783 Seconds to get 720 items

可见同样是用解析html的方法，nodejs速度完虐python。字符串查找呢？

Python

var request = require(“request”);

var cheerio = require(‘cheerio’);

var fs = require(‘fs’);

var t1 = new Date().getTime();

var allpage = 30;

var urllist = new Array() ;

var urldata = new Array();

var mark = 0;

var no = 0;

for (var i=0; i<allpage; i++) {

if (i==0)

urllist[i] = ‘http://www.zhongchou.com/browse/di’

else

urllist[i] = ‘http://www.zhongchou.com/browse/di-p’+(i+1).toString();

// console.log(urllist[i]);

request(urllist[i],function(error,resp,body){

if (!error && resp.statusCode==200) {

getUrl(body);

}

});

}

function getUrl(data) {

mark += 1;

var start = 5000

while (true) {

var index1 = data.indexOf(“deal-show”, start);

if (index1 == –1)

break;

var url = “http://www.zhongchou.com/deal-show/”+data.substring(index1+10,index1+19)+“\n”;

// console.log(url);

if (urldata.indexOf(url)==–1) {

urldata.push(url);

}

start = index1 + 1000;

}

if (mark==allpage) {//所有页面执行完毕

// console.log(urldata);

no = urldata.length;

fs.writeFile(‘./nodestandard.txt’,urldata.join(“”),function(err){

if(err) throw err;

});

var t2 = new Date().getTime();

console.log(“it takes “ + ((t2–t1)/1000).toString() + ” Seconds to get “ + no.toString() + ” items”);

}

实验五次的结果：

Python

it takes 3.695 Seconds to get 720 items

it takes 3.781 Seconds to get 720 items

it takes 3.94 Seconds to get 720 items

it takes 3.705 Seconds to get 720 items

it takes 3.601 Seconds to get 720 items

可见和解析起来的时间是差不多的。

综上

由我自己了解的知识和本实验而言，我的结论是：python用上多线程下载速度能够比过nodejs，但是解析网页这种事python没有nodejs快，毕竟js原生就是为了写网页，而且复杂的爬虫总不能都用字符串去找吧。

转载自演道,想查看更及时的互联网产品技术热点文章请点击http://go2live.cn

About The Author

bjmayor

程序员，码农，php,python,ios,android,go，产品经理，创业。

2025年七月
M	T	W	T	F	S	S
« Jan
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

爬虫项目

实战比较

python原始版

python多线程版

python改进

单线程

多线程

反思

NodeJs

综上

Related Posts

About The Author

bjmayor