同步操作将从 刘文聪/DataLineageResearch 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
数据平台建设过程中需要对数据血缘进行解析,通过对血缘数据的探索,可以快速获取数据,加快数据开发的效率。
对报表SQL进行解析获取语法树,从而进一步获取列级的血缘,生成报表页面与数据源的关系。
单张报表的血缘关系可以按照列级字段拆分成相应的血缘信息:
基于血缘信息可以计算不同报表之间血缘的相似程度,从而避免过多报表的重复开发与数据冗余。
可以通过语法树的解析得到列级的血缘。那么怎么得到语法树呢?
可以通过以上两种工具的使用,得到语法树,那么剩下的问题就是如何将语法树解析为血缘信息。
本版本使用一个Java POJO对象 LineageColumn 对报表中的列级血缘进行存储:
name | desc |
---|---|
targetColumnName | 目标字段,即SELECT的列 |
sourceDbName | 字段来源DB |
sourceTableName | 字段来源表 |
sourceColumnName | 字段来源列 |
expression | 表达式 |
isEnd | 是否结束标识 |
通过对语法树的头结点进行递归获取子树,最终获得树的叶子节点即为最终的血缘信息,以SQL为例
select
user_id as uid
,user_name as uname
from
(
select user_id, concat("test",user_name) as user_name
from user
)t
需要经过两层递归,过程为
最终血缘信息为
--- 输出列uid 数据来源于user表的user_id 列
uid from:{"expression":"user_id","isEnd":true,"sourceTableName":"user","targetColumnName":"user_id"}
-- 输出列uname 数据来源于常量字段test及 user表的user_name 列
uname from:{"expression":"concat('test', user_name)","isEnd":true,"sourceTableName":"user","targetColumnName":"user_name"}
uname from:{"expression":"concat('test', user_name)","isEnd":true,"sourceTableName":"user","targetColumnName":"'test'"}
对于报表A与报表B,分别解析两张报表SQL可以获取到血缘信息A与血缘信息B。
根据血缘信息A与血缘信息B的LineageColumn对象可以计算出血缘信息之间的相似度:
血缘信息相似度初步定义如下:
基于已爬取的76张可以被正确解析的SQL脚本, 分别生成脚本对应的血缘数据, 并对血缘数据之间两两计算相似度,可以得到2850个无重复两两组合,计算出相似度分布如下图:
x轴为相似度范围,纵轴为2850个组合中在该相似度范围内的组合数。
-- 报表A
select date_format(imp_date,'%Y-%m-%d') as imp_date,
case extra
when 'icon' then '主动'
when 'push' then 'push推送'
when 'weixin' then '微信'
when 'mobileQQpush' then '手Q'
else 'other'
end as extra,sum(value2) as uv
from t_1
where imp_date>='startDate' and imp_date<='endDate'
and indicator in ('indicator')
and subin=0
and extra in ('icon','push','weixin','mobileQQpush')
group by imp_date,case extra
when 'icon' then '主动'
when 'push' then 'push推送'
when 'weixin' then '微信'
when 'mobileQQpush' then '手Q'
else 'other'
end
order by imp_date
-- 报表B
select imp_date,
case extra
when 'icon' then '主动'
when 'push' then 'push推送'
when 'weixin' then '微信'
when 'mobileQQpush' then '手Q'
end as extra,sum(value1) as pv
from t_1
where imp_date>='startDate' and imp_date<='endDate'
and indicator in ('indicator')
and subin=0
and extra<>'other'
group by imp_date,extra
order by imp_date
相似度构成如下:
字段数量相似度:1.0
字段来源DB相似度:1.0
字段来源表相似度:1.0
字段来源列相似度:0.71
由于表名中默认不含有库名信息,所以默认在同一个库中,即字段来源DB相似度为1,可以看出 报表A和报表B之间的差别在于uv字段的字段来源列不同,而其他的大部分血缘信息包括表名库名等信息都是完全一致的。
-- 报表A
select date_format(imp_date,'%Y-%m-%d') as imp_date,
case extra
when 'icon' then '主动'
when 'push' then 'push推送'
when 'weixin' then '微信'
when 'mobileQQpush' then '手Q'
else 'other'
end as extra,sum(value2) as uv
from t_1
where imp_date>='startDate' and imp_date<='endDate'
and indicator in ('indicator')
and subin=0
and extra in ('icon','push','weixin','mobileQQpush')
group by imp_date,case extra
when 'icon' then '主动'
when 'push' then 'push推送'
when 'weixin' then '微信'
when 'mobileQQpush' then '手Q'
else 'other'
end
order by imp_date
-- 报表B
select imp_date,
case when c1 = 1 and c2 = 1 and c3 = 1 then '顶部下拉刷新'
when c1 = 1 and c2 = 1 and c3 = 2 then '频道tab点击刷新'
when c1 = 1 and c2 = 1 and c3 = 3 then '中部主动刷新p'
when c1 = 1 and c2 = 1 and c3 = 4 then '底部tab栏点击刷新'
when c1 = 1 and c2 = 1 and c3 = 9999 then '顶部主动刷新'
when c1 = 1 and c2 = 2 and c3 = 9999 then '底部主动刷新'
when c1 = 1 and c2 = 9999 and c3 = 9999 then '主动刷新'
when c1 = 2 and c2 = 9999 and c3 = 9999 then '自动刷新'
when c1 = 9999 and c2 = 9999 and c3 = 9999 then '汇总'
end mode
,
sum(uv) uv
from t_2
where imp_date>='startDate' and imp_date<='endDate'
and ssitename='9999'
and ostype = 'ostype' and (ista is null or ista = '-')
group by imp_date,c1,c2,c3
相似度构成如下:
字段数量相似度:1.0
字段来源DB相似度:1.0
字段来源表相似度:0.0
字段来源列相似度:0.1
由于两张报表所使用的数据源表完全不同,因此字段来源表相似度为0,但是两张报表的字段来源列中存在相同的字段命名 imp_date ,因此存在一定的字段来源列相似度。
select imp_date,
case devtype when 1 then 'all' when 2 then 'android' when 3 then 'ipad' when 4 then 'iphone' else devtype end as devtype,
max(case indicator when 105 then value end ) value_1,
max(case indicator when 105 then concat(rate1,'%') end ) rate1_1,
max(case indicator when 105 then concat(rate2,'%') end ) rate2_1,
max(case indicator when 106 then concat(value,'%') end ) value_2,
max(case indicator when 106 then concat(rate1,'%') end ) rate1_2,
max(case indicator when 106 then concat(rate2,'%') end ) rate2_2
from (
select imp_date,indicator,devtype,value,rate1,rate2
from t_1
where imp_date>='startDate' and imp_date<='endDate'
and indicator in ('indicator')
and subin=0
and devtype in (devtype)
) t
group by imp_date,devtype
order by imp_date desc,devtype
select
a.imp_date as imp_date, dau, pv, uv, avg_pv, video_pv, video_uv,avg_video_pv,pic_pv, pic_uv,avg_pic_pv,
play_vv, play_uv,avg_play_vv,play_video_vv, play_video_uv,avg_play_video_vv,play_pic_vv, play_pic_uv,avg_play_pic_vv, video_click_vv, video_click_uv
from(
select distinct
imp_date,
pv, uv, ROUND((pv) / (uv) , 2) AS avg_pv,
video_pv, video_uv, ROUND((video_pv) / (video_uv) , 2) AS avg_video_pv,
pic_pv, pic_uv, ROUND((pic_pv) / (pic_uv) , 2) AS avg_pic_pv,
(play_video_vv+play_pic_vv) as play_vv, play_uv, ROUND((play_video_vv+play_pic_vv) / (play_uv) , 2) AS avg_play_vv,
play_video_vv, play_video_uv, ROUND((play_video_vv) / (play_video_uv) , 2) AS avg_play_video_vv,
play_pic_vv, play_pic_uv, ROUND((play_pic_vv) / (play_pic_uv) , 2) AS avg_play_pic_vv,
video_click_vv, video_click_uv
from
t_2
where
imp_date>='startDate' and imp_date <= 'endDate'
and (aboutista is null or aboutista = 'all') and (site_name = 'all') and (activefrom is null or activefrom='all')
) a
join
(
select
imp_date,value as dau
from
t_3
where indicator='103' and devtype = 1 and imp_date>='startDate' and imp_date <= 'endDate'
)b
on a.imp_date = b.imp_date
order by imp_date desc
相似度构成如下:
字段数量相似度:0.06
字段来源DB相似度:0
字段来源表相似度:0.0
字段来源列相似度:0.1
由于两张报表的字段来源DB和字段来源表均不同,因此相似度在很低的一个取值范围内。
测试报表SQL数据文件路径:src/test/resources/sql_data 测试入口java脚本路径:src/test/java/com/lan/lineage/druid/LineageCalculator.java
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。