代码拉取完成,页面将自动刷新
同步操作将从 barry异想世界/scrapy-51job 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
import numpy as np
import re
from elasticsearch import Elasticsearch
class PreprocessExpPipeline(object):
exp_range = re.compile(r'((?:\d+\.)?\d+)\-?((?:\d+\.)?\d+)?')
def process_item(self, item):
exp = item['exp']
if exp.strip() == '':
item['exp_min'], item['exp_max'] = -1, -1
elif exp.strip() == '无需经验':
item['exp_min'], item['exp_max'] = 0, 0
else:
range_ = [i for i in self.exp_range.search(
exp).groups() if i is not None]
range_ = np.array(range_, dtype=float)
item['exp_min'] = round(range_[0], 1)
item['exp_max'] = round(range_[-1], 1)
print(exp, item['exp_min'], item['exp_max'])
return item
pl = PreprocessExpPipeline()
ES_HOST = '120.78.80.22'
ES_PORT = 10096
ES_USER = 'kabana' # suck name
ES_PASS = 'iWhDIJuCEl8hBfId'
es = Elasticsearch([f'{ES_USER}:{ES_PASS}@{ES_HOST}:{ES_PORT}/'])
page = es.search(index='51job', scroll='2m', size=20)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while(scroll_size > 0):
page = es.scroll(scroll_id=sid, scroll='2m')
# Update the scroll ID
sid = page['_scroll_id']
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
for i in page['hits']['hits']:
body = pl.process_item(i['_source'])
es.index('51job', body, id=i['_id'])
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。