colly 自动抓取资讯
2015 年 7 月 26 日
colly 在golang中的地位,比之scrapy在python的作用,都是爬虫界的大佬。本文用其抓取博文资讯, 从收集器实例配置,goQuery进行dom节点数据抓取,自动分页访问,到csv数据持久化,json控制台输出,全程简单直观。
Code
抓取数据入口为社区某用户博客列表页,比如 https://learnku.com/blog/pardon
package main import ( "encoding/csv" "encoding/json" "log" "os" "regexp" "strconv" "strings" "github.com/gocolly/colly" ) // Article 抓取blog数据 type Article struct { ID int `json:"id,omitempty"` Title string `json:"title,omitempty"` URL string `json:"url,omitempty"` Created string `json:"created,omitempty"` Reads string `json:"reads,omitempty"` Comments string `json:"comments,omitempty"` Feeds string `json:"feeds,omitempty"` } // 数据持久化 func csvSave(fName string, data []Article) error { file, err := os.Create(fName) if err != nil { log.Fatalf("Cannot create file %q: %s\n", fName, err) } defer file.Close() writer := csv.NewWriter(file) defer writer.Flush() writer.Write([]string{"ID", "Title", "URL", "Created", "Reads", "Comments", "Feeds"}) for _, v := range data { writer.Write([]string{strconv.Itoa(v.ID), v.Title, v.URL, v.Created, v.Reads, v.Comments, v.Feeds}) } return nil } func main() { articles := make([]Article, 0, 200) // 1.准备收集器实例 c := colly.NewCollector( // 开启本机debug // colly.Debugger(&debug.LogDebugger{}), colly.AllowedDomains("learnku.com"), // 防止页面重复下载 // colly.CacheDir("./learnku_cache"), ) // 2.分析页面数据 c.OnHTML("div.blog-article-list > .event", func(e *colly.HTMLElement) { article := Article{ Title: e.ChildText("div.content > div.summary"), URL: e.ChildAttr("div.content a.title", "href"), Feeds: e.ChildText("div.item-meta > a:first-child"), } // 查找同一集合不同子项 e.ForEach("div.content > div.meta > div.date>a", func(i int, el *colly.HTMLElement) { switch i { case 1: article.Created = el.Attr("data-tooltip") case 2: // 用空白切割字符串 article.Reads = strings.Fields(el.Text)[1] case 3: article.Comments = strings.Fields(el.Text)[1] } }) // 正则匹配替换,字符串转整型 article.ID, _ = strconv.Atoi(regexp.MustCompile(`\d+`).FindAllString(article.URL, -1)[0]) articles = append(articles, article) }) // 下一页 c.OnHTML("a[href].page-link", func(e *colly.HTMLElement) { e.Request.Visit(e.Attr("href")) }) // 启动 c.Visit("https://learnku.com/blog/pardon") // 输出 csvSave("pardon.csv", articles) enc := json.NewEncoder(os.Stdout) enc.SetIndent("", " ") enc.Encode(articles) // 显示收集器的打印信息 log.Println(c) }
Output
控制台输出
.... "id": 30604, "title": "教程: TodoMVC 与 director 路由", "url": "https://learnku.com/articles/30604", "created": "2019-07-01 12:42:01", "reads": "650", "comments": "0", "feeds": "0" }, { "id": 30579, "title": "flaskr 进阶笔记", "url": "https://learnku.com/articles/30579", "created": "2019-06-30 19:01:04", "reads": "895", "comments": "0", "feeds": "0" }, { "id": 30542, "title": "教程 Redis+ flask+vue 在线聊天", "url": "https://learnku.com/articles/30542", "created": "2019-06-29 12:19:45", "reads": "2760", "comments": "1", "feeds": "2" } ] 2019/12/20 15:50:14 Requests made: 5 (5 responses) | Callbacks: OnRequest: 0, OnHTML: 2, OnResponse: 0, OnError: 0
csv 文本输出
ID,Title,URL,Created,Reads,Comments,Feeds 37991,ferret 爬取动态网页,https://learnku.com/articles/37991,2019-12-15 10:43:03,219,0,3 37803,匿名类 与 索引重建,https://learnku.com/articles/37803,2019-12-09 19:35:09,323,1,0 37476,大话并发,https://learnku.com/articles/37476,2019-12-08 21:17:55,612,0,4 37738,三元运算符,https://learnku.com/articles/37738,2019-12-08 09:44:36,606,0,0 37719,笔试之 模板变量替换,https://learnku.com/articles/37719,2019-12-07 18:30:42,843,0,0 37707,笔试之 连续数增维,https://learnku.com/articles/37707,2019-12-07 13:50:17,872,0,0 37616,笔试之 一行代码求重,https://learnku.com/articles/37616,2019-12-05 12:10:24,792,0,0 ....
Colly
- 简洁API
- 快速(单个内核上> 1k请求/秒)
- 管理请求延迟和每个域的最大并发
- 自动cookie和会话处理
- 同步/异步/并行抓取
- 分布式爬虫
- 自动编码非unicode响应
- 支持 Robots.txt
- 支持 Google App Engine