1 Star 0 Fork 0

NGP / nytime_down

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
generated_start_domain_url.py 1.15 KB
一键复制 编辑 原始数据 按行查看 历史
NGP 提交于 2022-08-17 16:58 . 1 数据清洗,由html2txt改为justext \n
import pandas as pd
from datetime import datetime
def datelist(beginDate, endDate):
# beginDate, endDate是形如‘20160601’的字符串或datetime格式
date_l = [datetime.strftime(x, '%Y-%m-%d') for x in list(pd.date_range(start=beginDate, end=endDate))]
# print( date_l)
return date_l
if __name__ == '__main__':
dateList = datelist('20070101', '20220816')
filter_list = []
for i in range(0, len(dateList)):
#每隔一天 分一段
if (i % 1 == 0):
filter_list.append(dateList[i].replace('-', ''))
#倒序
filter_list.reverse()
target_url_list = []
# 跳过最后一个 --> len() -1
for i in range(0, len(filter_list) - 1):
origin_url_part_1 = 'https://www.nytimes.com/search?dropmab=true&endDate='
origin_url_part_2 = '&query=&sort=best&startDate='
target_url = origin_url_part_1 + filter_list[i] + origin_url_part_2 + filter_list[i + 1]
target_url_list.append(target_url)
print(target_url_list)
with open('resource/start_domain_url_list.txt', 'w') as f:
for url in target_url_list:
f.write(str(url))
f.write('\n')
1
https://gitee.com/ngp320/nytime_down.git
git@gitee.com:ngp320/nytime_down.git
ngp320
nytime_down
nytime_down
master

搜索帮助