208 Star 906 Fork 298

GVP科学大数据开源社区 / PiFlow

加入 Gitee
与超过 800 万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README_CN.md

πFlow是一个简单易用,功能强大的大数据流水线系统。

目录

特性

  • 简单易用

    • 可视化配置流水线
    • 监控流水线
    • 查看流水线日志
    • 检查点功能
    • 流水线调度
  • 扩展性强:

    • 支持自定义开发数据处理组件
  • 性能优越:

    • 基于分布式计算引擎Spark开发
  • 功能强大:

    • 提供100+的数据处理组件
    • 包括Hadoop 、Spark、MLlib、Hive、Solr、Redis、MemCache、ElasticSearch、JDBC、MongoDB、HTTP、FTP、XML、CSV、JSON等
    • 集成了微生物领域的相关算法

架构

要求

  • JDK 1.8
  • Scala-2.11.8
  • Apache Maven 3.1.0
  • Spark-2.1.0 及以上版本
  • Hadoop-2.6.0

开始

Build πFlow:

  • install external package

        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
        mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar
  • mvn clean package -Dmaven.test.skip=true

        [INFO] Replacing original artifact with shaded artifact.
        [INFO] Reactor Summary:
        [INFO]
        [INFO] piflow-project ..................................... SUCCESS [  4.369 s]
        [INFO] piflow-core ........................................ SUCCESS [01:23 min]
        [INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
        [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
        [INFO] piflow-server ...................................... SUCCESS [02:05 min]
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 06:01 min
        [INFO] Finished at: 2020-05-21T15:22:58+08:00
        [INFO] Final Memory: 118M/691M
        [INFO] ------------------------------------------------------------------------

运行 πflow Server:

  • Intellij上运行PiFlow Server:

    • 下载 piflow: git clone https://github.com/cas-bigdatalab/piflow.git

    • 将PiFlow导入到Intellij

    • 编辑配置文件config.properties

    • Build PiFlow jar包:

      • Run --> Edit Configurations --> Add New Configuration --> Maven
      • Name: package
      • Command line: clean package -Dmaven.test.skip=true -X
      • run 'package' (piflow jar file will be built in ../piflow/piflow-server/target/piflow-server-0.9.jar)
    • 运行 HttpService:

      • Edit Configurations --> Add New Configuration --> Application
      • Name: HttpService
      • Main class : cn.piflow.api.Main
      • Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
      • run 'HttpService'
    • 测试 HttpService:

      • 运行样例流水线: ../piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
      • 需要修改API中的server ip 和 port
  • 通过Release版本运行PiFlow:

    • 根据需求下载最新版本PiFlow:
      https://github.com/cas-bigdatalab/piflow/releases/download/v1.0/piflow-server-v1.0.tar.gz

    • 解压piflow-server-v1.0.tar.gz:
      tar -zxvf piflow-server-v1.0.tar.gz

    • 编辑配置文件config.properties

    • 运行、停止、重启PiFlow Server
      start.sh、stop.sh、 restart.sh、 status.sh

    • 测试 PiFlow Server

      • 设置环境变量 PIFLOW_HOME
        • vim /etc/profile
          export PIFLOW_HOME=/yourPiflowPath
          export PATH=$PATH:$PIFLOW_HOME/bin

        • 运行如下命令
          piflow flow start example/mockDataFlow.json
          piflow flow stop appID
          piflow flow info appID
          piflow flow log appID

          piflow flowGroup start example/mockDataGroup.json
          piflow flowGroup stop groupId
          piflow flowGroup info groupId

  • 如何配置config.properties

    #spark and yarn config
    spark.master=yarn
    spark.deploy.mode=cluster
    
    #hdfs default file system
    fs.defaultFS=hdfs://10.0.86.191:9000
    
    #yarn resourcemanager.hostname
    yarn.resourcemanager.hostname=10.0.86.191
    
    #if you want to use hive, set hive metastore uris
    #hive.metastore.uris=thrift://10.0.88.71:9083
    
    #show data in log, set 0 if you do not want to show data in logs
    data.show=10
    
    #server port
    server.port=8002
    
    #h2db port
    h2.port=50002

运行πFlow Web请到如下链接,πFlow Server 与 πFlow Web版本要对应:

接口Restful API:

  • flow json(可查看piflow-bin/example文件夹下的流水线样例)

    flow example
    
          {
            "flow": {
              "name": "MockData",
              "executorMemory": "1g",
              "executorNumber": "1",
              "uuid": "8a80d63f720cdd2301723b7461d92600",
              "paths": [
                {
                  "inport": "",
                  "from": "MockData",
                  "to": "ShowData",
                  "outport": ""
                }
              ],
              "executorCores": "1",
              "driverMemory": "1g",
              "stops": [
                {
                  "name": "MockData",
                  "bundle": "cn.piflow.bundle.common.MockData",
                  "uuid": "8a80d63f720cdd2301723b7461d92604",
                  "properties": {
                    "schema": "title:String, author:String, age:Int",
                    "count": "10"
                  },
                  "customizedProperties": {
                  }
                },
                {
                  "name": "ShowData",
                  "bundle": "cn.piflow.bundle.external.ShowData",
                  "uuid": "8a80d63f720cdd2301723b7461d92602",
                  "properties": {
                    "showNumber": "5"
                  },
                  "customizedProperties": {
                  }
                }
              ]
            }
          }
         
  • CURL方式:

  • 命令行方式:

    • set PIFLOW_HOME
      vim /etc/profile
      export PIFLOW_HOME=/yourPiflowPath/piflow-bin
      export PATH=$PATH:$PIFLOW_HOME/bin

    • command example
      piflow flow start yourFlow.json
      piflow flow stop appID
      piflow flow info appID
      piflow flow log appID

      piflow flowGroup start yourFlowGroup.json
      piflow flowGroup stop groupId
      piflow flowGroup info groupId

Docker镜像

  • 拉取Docker镜像
    docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v1.1

  • 查看Docker镜像的信息
    docker images

  • 通过镜像Id运行一个Container,所有PiFlow服务会自动运行。请注意设置HOST_IP
    docker run -h master -itd --env HOST_IP=*.*.*.* --name piflow-v1.1 -p 6001:6001 -p 6002:6002 [imageID]

  • 访问 "HOST_IP:6001", 启动时间可能有些慢,需要等待几分钟

  • if somethings goes wrong, all the application are in /opt folder

页面展示

  • 登录:

  • 流水线列表:

  • 创建流水线:

  • 配置流水线:

  • 运行流水线:

  • 监控流水线:

  • 流水线日志:

  • 流水线组列表:

  • 配置流水线组:

  • 监控流水线组:

  • 运行态流水线列表:

  • 流水线模板列表:

  • 数据源:

  • 调度:

  • 自定义组件:

联系我们

  • Name:吴老师
  • Mobile Phone:18910263390
  • WeChat:18910263390
  • Email: wzs@cnic.cn
  • QQ Group:1003489545
  • WeChat group is valid for 7 days

简介

混合型科学大数据流水线系统,包含丰富的处理器组件,提供Shell、DSL、Web配置界面、任务调度、任务监控等功能 展开 收起
Scala 等 5 种语言
Apache-2.0
取消

发行版

暂无发行版

PiFlow

贡献者

全部

近期动态

加载更多
不能加载更多了
Scala
1
https://gitee.com/opensci/piflow.git
git@gitee.com:opensci/piflow.git
opensci
piflow
PiFlow
master

搜索帮助