1 Star 0 Fork 9

昆明晨晟招标有限责任公司 / scrapy-51job

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
fix_old_data.py 1.42 KB
一键复制 编辑 原始数据 按行查看 历史
Joshua 提交于 2021-03-26 16:55 . update
import numpy as np
import re
from elasticsearch import Elasticsearch
class PreprocessExpPipeline(object):
exp_range = re.compile(r'((?:\d+\.)?\d+)\-?((?:\d+\.)?\d+)?')
def process_item(self, item):
exp = item['exp']
if exp.strip() == '':
item['exp_min'], item['exp_max'] = -1, -1
elif exp.strip() == '无需经验':
item['exp_min'], item['exp_max'] = 0, 0
else:
range_ = [i for i in self.exp_range.search(
exp).groups() if i is not None]
range_ = np.array(range_, dtype=float)
item['exp_min'] = round(range_[0], 1)
item['exp_max'] = round(range_[-1], 1)
print(exp, item['exp_min'], item['exp_max'])
return item
pl = PreprocessExpPipeline()
ES_HOST = '120.78.80.22'
ES_PORT = 10096
ES_USER = 'kabana' # suck name
ES_PASS = 'iWhDIJuCEl8hBfId'
es = Elasticsearch([f'{ES_USER}:{ES_PASS}@{ES_HOST}:{ES_PORT}/'])
page = es.search(index='51job', scroll='2m', size=20)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while(scroll_size > 0):
page = es.scroll(scroll_id=sid, scroll='2m')
# Update the scroll ID
sid = page['_scroll_id']
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
for i in page['hits']['hits']:
body = pl.process_item(i['_source'])
es.index('51job', body, id=i['_id'])
Python
1
https://gitee.com/kmcsybw/scrapy-51job.git
git@gitee.com:kmcsybw/scrapy-51job.git
kmcsybw
scrapy-51job
scrapy-51job
master

搜索帮助