1 Star 0 Fork 64

纹络书森 / vscrawler

forked from virjar / vscrawler 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

vscrawler

vscrawler是一个更加适合抓取的爬虫框架,他不是教科书似的爬虫,准确说他不是爬虫,没有广度优先遍历这些说法,他所面临的网站URL不是网络里面的网络拓扑图而是一个个目标明确的抓取任务。

vscrawler的一个重要特性就是他把下载和解析放在了同一个组件里面,同时他天生支持多用户登录。vscrawler设计的目的是填补webmagic在某些方面的不足,不过vscrawler本身很多思想也参考webmagic,感谢webmagic作者黄大大。

编写vscrawler的契机是本人在抓取企信宝的时候遇到的滑块验证码突破问题,多用户登录问题,复杂流程抽取问题。同时他基于dungproxy作为网络层API,天生接入了代理服务。vscrawler目前还是我花不到两天弄出来的小框架,可能有各种不完善的地方,不过一定会越来越完善的,期待越来越好的明天

maven坐标-快照版本

 <dependencies>
        <dependency>
            <groupId>com.virjar</groupId>
            <artifactId>vscrawler-core</artifactId>
            <version>0.0.1-SNAPSHOT</version>
        </dependency>
    </dependencies>
    <repositories>
        <repository>
            <id>ossrh</id>
            <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </repository>
 </repositories>
    

maven坐标-正式版本

期待哦

vscrawler特点

合并下载组件和解析组件 抓取的时候,不像普通爬虫简单的GET请求便可以得到数据,实际可能寻找复杂的认证流程,还有多网页状态关联问题。抓取和解析是相互关联,多次之后才能拿到数据。所以下载和解析合并,能够最大限度的让用户扩展认证流程
多用户隔离和登录状态维护 vscrawler包装session概念,session就是一个登录成功的用户,他和其他用户状态完全隔离的,完整支持多用户并发登录和抓取数据
任务种子抽象打散 vscrawler的种子定义为字符串,而非一个个URL,普通爬虫就是根据URL在网上进行广度优先遍历。但是抓取场景,种子可能就是一个页码,一个关键字,用户+关键字,url+时间戳(实现分时间抓取相同url),产生的新种子也不一定是url,而是关键字等,这样vscrawler的调度,就是一个个目标明确的请求处理序列。不会像webmagic从首页开始爬,抓很久都不能抓到感兴趣的数据
代理IP池 vscrawler使用dungproxy作为代理池,这是一个免费的代理IP池,拥有免费代理资源,普通封堵完全够用了,dungproxy本身作为代理池方案,可以友好的处理代理需求,如接入其他代理资源等。dungproxy的顺序惩罚容器算法极大的提高了高质量IP可用性
事件循环 vscrawler实现了事件循环机制,爬虫运行过程,可以方便的,低耦合的接入和定义事件消息,方便的实现功能扩展。(如用户突然密码错误,证明用户不可用,监听这个事件,可以发送消息通知管理者,更改数据库用户状态等;再或者突然爬虫请求都失败了,失败事件频率变高,证明目标网站可能宕机;或者监听爬虫配置文件变化,动态调整爬虫参数。等等。)
声明式自动注册事件循环 在事件循环机制上面,配合动态代理,实现了接口-实现的事件扩展机制。实现发送者使用接口发送实现,各个接口实现类都能够接收到事件消息,这样扩展点的挂载,定义,卸载,调用都会非常方便了
热配置 vscrawler的参数,都是基于配置文件动态控制的,这点和webmagic不同,vscrawler可以在爬虫运行过程修改爬虫参数,修改后实时生效,并发送事件到爬虫内部所有感兴趣的组件。这个机制就是基于自动事件机制实现,也是事件循环的一个运用
表达式自动计算 配置文件中,为了方便,可以配置一个表达式作为数值,比如1个小时配置为:1 * 60 * 60 *1000 ,系统会自动转化为3600000
动态线程池 当你觉得爬虫抓取太快,导致大量请求失败,或者爬虫抓取比较慢,像加快爬虫速度。在之前,你可能需要停止爬虫,改程序(或者你的设计好一些,通过配置文件读取线程数目)。现在你不需要这样了,你可以直接修改vscralwer的配置文件,配置线程数目,线程池的线程数目就马上变成你配置的数目了
等等 更多炫酷的功能,欢迎提建议

vscrawler为何产生? 其实并不是想抢当前已经存在的爬虫框架的饭碗。不过确实是因为使用当前已经存在的爬虫框架不能解决我所面临的问题,所以才撸一套。 需要申明一点,本人并不想普渡众生,写这个框架的主要目的是使得个人使用顺手,如果你有好的想法或者issue,欢迎提交,但是个人没有义务帮你解决使用上的问题。 这样说也是因为vscrawler的定义在数据抓取而非简单爬虫,vscrawler在框架灵活性和简便性上面多会选择前者,导致vscrawler的使用难度原因高于webmagic,WebCollector等框架。 vscrawler不是入门级的爬虫框架,拒绝菜鸟入坑 以下几种情况会被认为不适合使用本爬虫:

  • 我的爬虫经常报timeout的异常,什么原因啊?
  • 如何提交post请求?
  • html是js产生的,应该怎么爬取啊?
  • 为什么我下载下来的网页看不到数据,但是浏览器能够看到?
  • 如何下载图片?
  • 网页需要登录,应该怎么办?
  • 抓取遇到验证码,应该怎么办?
  • 求问大佬,爬网页的时候突然开始抛Socket超时的异常,这是啥情况,是IP被封了吗?用HtmlUnit的getPage打开百度就正常, 打开我要爬得网页就抛Socket超时异常
  • 这个title 我该怎么获取?
  • xpath规则应该怎么写

上面的问题,应该是你能够自己想到答案的,而不需要像他人请教,或者问题太宽泛,其他人也不能回答。这样的表现使用vscrawler应该是很困难的。

对比和吐槽

有人喜欢造轮子,原因我总结如下

  • 为了学习,模仿成熟框架实现一个demo,多在于学生时代,找工作的时候面试官很喜欢看到这种
  • 为了装逼
  • 市面上产品不好用,为了满足自己的需求会把框架使用得比较恶心
  • 对市面上产品不了解,别人有实现完善的自己不知道
  • 市面上有已经实现了的,但是个人比较高冷,我自己实现一个,用得更加顺手

写vscrawler的动机是啥呢?嗯,装逼

作者简介

为啥装逼,是因为我太年轻。上次写了一个叫做dungproxy的代理层中间件,有同学突然发现使用的是一个刚刚毕业不到半年的本科生空闲时间写的框架,很是担忧。觉得入坑很危险。 所以这里先声明,本人现在毕业不到一年,年轻人会比较高调,如果你觉得vscrawler的设计不靠谱,也许你的猜想是对的呢。

作者介绍:2016年毕业于四川大学软件学院,长期作为一个没有工程能力的大神。喜欢写工具。擅长c语言。但是好多年不写c。目前主要使用java作为编程语言。当前已经开源的项目列表如下:

项目 简介
spring,springMVC,mybatis代码生成器 大学毕业设计,学校说毕业设计版权是学校的,但是学校只是把它放到资料库里面了,所以今后还是小改一下变成我个人的吧。
dungproxy 代理IP中间层,发源于大学三年级在企业实习时老大的项目要求,经过打磨,已经成为部分企业用在生产环境的框架了,其server端基于ssm-gen开发
vscrawler 抓取框架,内置各种爬虫封堵策略,便于灵活的获取目标网站数据,目前不可用在生产环境,其中网络层使用dungproxy
jscrack js注入工具,将自定义js文件注入到目标网站,用于破解网站加解密协议
jsrepair js反混淆工具,对进行过混淆压缩的js代码进行美化,主要体现在预计算,格式化,逗号表达式拆卸,三目运算符拆卸等功能

继续对比

webmagic,我了解webmagic是很多人用在了生产环境上,webmagic本身的代码结构是非常优秀的,架构简洁清晰,功能扩展比较方便。xsoup抽取器很强大和方便。

webCollector: 看packagename,应该也是一位同学大学开始写的代码吧。合肥工业大学? webCollector我觉得使用伯克利DB挺好,适合管理大量任务,适合断点续爬。但是使用jdk内置的URLConnection访问网络,以及代码结构问题,我感觉功能扩展很不方便。 然后就是方法写太长了,很喜欢用抽象类,但是java单继承特性导致功能扩展很不方便

SpiderMan: webmagic作者前期对webMagic的介绍里面,有说借鉴了SpiderMan。说明SpiderMan是一个比较老的爬虫框架吧。SpiderMan本身很有特点,就是基于配置文件的爬虫。这也是很多人希望做到的,但是似乎做到这点儿很难。方便性和灵活性的均衡很不好做,spiderMan的配置文件肯定 导致灵活性降低,如果强行实现xml文件不容易描述的规则,那么导致SpiderMan的语法将会特别复杂。所以大多数人应该没有学习xml语法的欲望吧。因为这套语法不是之前就有的规范,而是spiderMan自己定义。不过spiderman的研究者应该不少, 很多公司所谓的leader会给手下程序员安排这种类型的活儿。leader觉得爬虫通过配置来实现最好,然后让底下的人实现,底下的人百度了一下就能找到SpiderMan。另一个评价,spiderMan只支持1.8,喜欢函数式,喜欢定义大量内部类,代码结构不太清晰。要知道国内大多数企业应该还是1.7的标准。估计不少同学会入坑

gatherplatform:没有仔细看,作者qq号码很牛逼,群里很少发言,发言内容只有打广告,没有看到过讨论技术或者回答问题。看介绍也是cms的爬虫吧。爬爬文章,新闻还可以。我的意思,如果使用这个,可以考虑一个叫做神箭手的爬虫平台

seimicrawler: 这个我很佩服,可惜出身太晚了,因为他的出现套路和webmagic几乎一致(受python的Scrapy启发,因为xpath不方便把jsoup单独抽取出现实现xpath和cssQuery的结合),这是一个大神,为了解决动态网页问题,能够自己封装一个浏览器。话说其他的孩子都只能通过jdk调本地浏览器。而他这个去掉了java层,直接跨语言跨机器通信 我个人理解是所有基于浏览器方案里面最稳定的一种。而且他的jsoupXpath比webmagic的xsoup更加完善。将是vscrawler后续架构的重点学习项目

参考

  • 主架构参考了webmagic,保留了process、pipeline等概念
  • 种子管理,参考了WebCollector,使用BerkeleyDB管理URL
  • 分布式方案,将会参考elastic-job 计划中
  • 抽取器,将会接入 jsoupXpath 计划中
  • 多站点爬虫,参考geccocrawler,使用classloader热加载 计划中
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: You must give any other recipients of the Work or Derivative Works a copy of this License; and You must cause any modified files to carry prominent notices stating that You changed the files; and You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "{}" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright 2017 virjar Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

适合抓取封堵的爬虫框架 展开 收起
Java
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Java
1
https://gitee.com/bzvs/vscrawler.git
git@gitee.com:bzvs/vscrawler.git
bzvs
vscrawler
vscrawler
master

搜索帮助