1 Star 0 Fork 12

yangyn / cockroach

forked from Microgoople / cockroach 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

cockroach 爬虫:又一个 java 爬虫实现

License

简介

cockroach[小强] 当时不知道为啥选了这么个名字,又长又难记,导致编码的过程中因为单词的拼写问题耽误了好长时间。

这个项目算是我的又一个坑吧,算起来挖的坑多了去了,多一个不多少一个不少。

一个小巧、灵活、健壮的爬虫框架,暂且叫做框架吧。

简单到什么程度呢,几句话就可以创建一个爬虫。

环境

  • java8 程序中用到了一些 java8 的新特性
  • maven

下面就逐点介绍一下:

小巧

新建一个 maven 项目,在 pom 文件中引入依赖

<dependency>
  <groupId>com.github.zhangyingwei</groupId>
  <artifactId>cockroach</artifactId>
  <version>1.0.5-Beta</version>
</dependency>

如果哪天我忘了更新文档了,一定要记住使用最新的版本,最新的版本,新的版本,版本,本。

在项目中新建一个测试类 App.java 并新建 main 方法。

实例

public static void main(String[] args){
    CockroachConfig config = new CockroachConfig()
                    .setAppName("我是一个小强")
                    .setThread(2); //爬虫线程数
    CockroachContext context = new CockroachContext(config);
    TaskQueue queue = TaskQueue.of();
    context.start(queue);
    
    // 以上就是一个完整的爬虫,下边的代码相当于一个生产者,往队列里边写任务,一旦写入任务,爬虫就会对任务进行爬取
    new Thread(() -> {
        int i = 0;
        while(true){
            i++;
            try {
                Thread.sleep(1000);
                String url = "http://www.xicidaili.com/wt/"+i;
                System.out.println(url);
                queue.push(new Task(url));
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            if (i > 1000) {
                break;
            }
        }
    }).start();
}

灵活

那灵活又体现在什么方面呢

  • 可以自定义 http 客户端(可选,默认使用 okhttp3)
  • 可以自定义结果的处理 (可选,默认使用打印处理器)

自定义 http 客户端

首先我们尝试一下自定义客户端

public class SelfHttpClient implements HttpClient {
       public HttpClient setProxy(HttpProxy proxy){
            //设置代理实现方法
       }
       public TaskResponse doGet(Task task) throws Exception{
            // get 请求实现方法
       }
   
       public HttpClient proxy(){
            // 应用代理到 http 客户端 方法
       }
   
       public TaskResponse doPost(Task task) throws Exception{
            // post 请求实现方法
       }
   
       public HttpClient setCookie(String cookie){
            // 设置 cookie 实现方法
       }
   
       public HttpClient setHttpHeader(Map<String, String> httpHeader){
            // 设置 header 实现方法
       }
}

应用自定义 http 客户端到爬虫

CockroachConfig config = new CockroachConfig()
    .setAppName("我是一个小强")
    .setThread(2) //爬虫线程数
    .setHttpClient(SelfHttpClient.class)

自定义结果处理类

自定义结果处理类

public class SelfStore implements IStore {
    @Override
    public void store(TaskResponse response) {
        System.out.println(response.getContent());
    }
}

这里简单的将结果打印了出来,在实际应用中,我们可以保存到数据库或者保存到文件中等等。值得一说的是,如果结果是 html 网页文本的话,我们还提供了 select("css选择器") 来对结果文本进行处理。

应用自定义 store 客户端到爬虫

CockroachConfig config = new CockroachConfig()
    .setAppName("我是一个小强")
    .setThread(2) //爬虫线程数
    .setHttpClient(SelfHttpClient.class)
    .setStore(SelfStore.class);

自定义错误处理类

当 http 请求网页出现错误的时候会统一定位到错误处理类,如果没有自定义错误处理类,系统会默认使用 DefaultTaskErrorHandler ,此处理类会吧错误信息打印出来。具体实现代码如下。

public class DefaultTaskErrorHandler implements ITaskErrorHandler {
    private Logger logger = Logger.getLogger(DefaultTaskErrorHandler.class);
    @Override
    public void error(Task task,String message) {
        logger.info("task error: "+message);
    }
}

如果需要自定义错误处理类,可以仿照以上代码,实现 ITaskErrorHandler 接口,在 error 方法中实现自己的处理逻辑。

在自定义错误处理类之后,我们需要把自定义类应用到爬虫。

CockroachConfig config = new CockroachConfig()
    .setAppName("我是一个小强")
    .setThread(2) //爬虫线程数
    .setHttpClient(SelfHttpClient.class)
    .setStore(SelfStore.class)
    .setTaskErrorHandler(SelfTaskErrorHandler.class);

健壮

说到健壮,这里主要体现在以下几个方面:

应对IP封锁

这里我们使用动态代理来解决这个问题。

动态代理的使用

CockroachConfig config = new CockroachConfig()
    .setAppName("我是一个小强")
    .setThread(2) //爬虫线程数
    .setHttpClient(SelfHttpClient.class)
    .setProxys("100.100.100.100:8888,101.101.101.101:8888")

如上所示,我们可以设置若干个代理 ip,最终将所有代理 ip 生成一个代理池,在爬虫请求之前,我们会从代理池中随机抽取一个 ip 做代理。

应对 http 请求中的 user-agent 问题

程序中实现了一个 user-agent 池,每次请求都会随机取出一个 user-agent 使用,目前在程序中集成了 17 种 user-agent,后续会考虑把这块开放出来到配置中,自定义配置(有没有意义呢?)。

程序中的异常处理问题

目前在异常处理这块,本身也不是非常擅长,已经尽力把异常控制在一个可控的范围内,程序中定义了很多自定义异常,这里没有什么发言权,就不细说了,各位要是有意见建议,欢迎拍砖。

所谓深度爬取

程序中并没有现成的深度爬取实现,是因为一般情况下我并不觉得深度爬取有什么卵用,但是也不是没有为深度爬取留出来一席之地。我们可以自己提取出页面中的链接并加入到任务队列中。以达到深度爬取的效果。

public class DemoStore implements IStore {

    private String id = NameUtils.name(DemoStore.class);

    public DemoStore() throws IOException {}

    @Override
    public void store(TaskResponse response) throws IOException {
        List<String> urls = response.select("a").stream().map(element -> element.attr("href")).collect(Collectors.toList());
        try {
            response.getQueue().push(urls);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

注解支持

最近忙里偷闲增加了注解支持,那么在使用注解之后,一个爬虫是什么样的呢?

@EnableAutoConfiguration
@AppName("hello spider")
@Store(PrintStore.class)
@AutoClose(true)
@ThreadConfig(num = 1)
@CookieConfig("asdfasdfasdfasdfasfasdfa")
@HttpHeaderConfig({
        "key1=value1",
        "key2=value2"
})
@ProxyConfig("1.1.1.1,2.2.2.2")
public class CockroachApplicationTest {
    public static void main(String[] args) throws Exception {
        TaskQueue queue = TaskQueue.of();
        queue.push(new Task("http://blog.zhangyingwei.com"));
        CockroachApplication.run(CockroachApplicationTest.class,queue);
    }
}

如上就是基本上所有注解的演示,那么抛开演示的部分,如果真的只是做一个demo,需要怎么写呢?

@EnableAutoConfiguration
public class CockroachApplicationTest {
    public static void main(String[] args) throws Exception {
        TaskQueue queue = TaskQueue.of();
        queue.push(new Task("http://blog.zhangyingwei.com"));
        CockroachApplication.run(CockroachApplicationTest.class,queue);
    }
}

没错,就是这么简单。这个爬虫就是爬取 http://blog.zhangyingwei.com 这个页面的内容并将结果打印出来。 在爬虫结果处理这个问题上,程序中默认使用 PringStore 这个类将所有结果打印出来。

动态 header 支持

最近做了一个工作职位的爬虫,在爬拉钩的时候遇到一个问题。需要登录才能爬取,这个当然配置 cookie 就能解决,但是拉钩的 cookie 里边做了防爬虫验证。cookie 里边有一个时间需要动态变化。所以就产生了这个功能。

这个功能使用起来如下:

Cookie 生成器

@CookieConfig(cookieGenerator = CookieGeneratorTest.class)
/**
 * Created by zhangyw on 2017/12/19.
 */
public class CookieGeneratorTest implements StringGenerator {

    @Override
    public String get(Task task) {
        String cookie = "v="+ UUID.randomUUID().toString();
        System.out.println(cookie);
        return cookie;
    }
}

在每次发生 http 请求之前,程序都会调用 Generator 的 get 方法。获取到本次的 cookie 值,并附加到 http 请求头中。

Header 生成器

由于程序中需要的 header 是 map 类型的数据,所以 header 生成器如下:

@HttpHeaderConfig(headerGenerator = HeaderGeneratorTest.class)
/**
 * Created by zhangyw on 2017/12/19.
 */
public class HeaderGeneratorTest implements MapGenerator {
    private Map headers = new HashMap();
    @Override
    public Map get(Task task) {
        return headers;
    }
}

以上就是目前所有的生成器,可以看到生成器中传入了 task 对象,这里是为了在爬虫应对不同的地址的时候使用不同的 cookie/header 。

算了还是举个栗子吧:

/**
 * Created by zhangyw on 2017/12/19.
 */
public class HeaderGeneratorTest implements MapGenerator {
    private Map headers = new HashMap();
    @Override
    public Map get(Task task) {
        if ("jobs.lagou".equals(task.getGroup())) {
            header.put("key","value");
            return headers;
        } else {
            return null;
        }
    }
}

OK,到此为止,就啰嗦这么多了。

关于分布式,我有话说

现在网上是个爬虫就要搞一下分布式,这令我很不爽。

实际上我看过几个所谓的分布式爬虫源码,他们所谓的分布式,连伪分布式都算不上!!!使用个 redis 做消息中间件就分布式了吗? 这就是所谓的分布式??这根本就不是分布式,本来我也准备使用 redis 做消息中间件来装个分布式的 B,但是写了一半忽然觉得有点恶心,遂删除了代码,还程序一个清静,也还我自己一个安心。

分布式这个坑肯定是要挖的!!!

所以,我的分布式将会包括:

  • 分布式消息中间件(有可能会使用 redis 或者自己实现一个; 为了还程序一个清静,最有可能会自己实现一个)
  • 分布式任务调度
  • 分布式容错机制
  • 分布式事务
  • 状态监控

所以,这个坑是越来越大了么??我靠,有点怕怕!! 至于这个坑什么时候填上,还能不能填上,看心情咯。。。

实际上,到现在我还没心情填这个分布式的坑。。。

PS

昨天下午开了几十个线程爬知乎,结果公司网管说疑似有 DOS 攻击,吓得我赶紧放在云上跑。

能看到这里你也是挺厉害了,留个 star 呗!😺😺

联系方式

Lisence

Lisenced under Apache 2.0 lisence

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "{}" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright {yyyy} {name of copyright owner} Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

又一个 java 爬虫 展开 收起
Java
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Java
1
https://gitee.com/yyn_0210/cockroach.git
git@gitee.com:yyn_0210/cockroach.git
yyn_0210
cockroach
cockroach
master

搜索帮助