scrape-yuanyuan/README.md

24 lines
749 B
Markdown
Raw Normal View History

2022-04-09 17:43:47 +01:00
# Province article scraping
A couple of scripts to scrape article text from various provinces for
a text analysis university course.
We need:
2022-04-09 17:45:35 +01:00
[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
2022-04-09 17:43:47 +01:00
: page 14-75
2022-04-09 17:45:35 +01:00
[Ningxia](http://wsjkw.nx.gov.cn/xwzx_279/tzgg/index.html)
2022-04-09 22:32:13 +01:00
: page 11-42 (actually 8-44?)
2022-04-09 17:43:47 +01:00
2022-04-09 17:45:35 +01:00
[Shanxi](http://sxwjw.shaanxi.gov.cn/zfxxgk/fdzdgknr/zcwj/xzgfxwj/index.html)
2022-04-09 23:06:27 +01:00
: page 2-18 (actually 2-20?)
2022-04-09 17:43:47 +01:00
2022-04-09 17:45:35 +01:00
[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
2022-04-09 17:43:47 +01:00
: page 10-20
The websites all have subtle differences, so there's simply a folder +
scripts for each (the scripts are simple enough that there's no need
for deduplication or anything complex). Written in python/js where
necessary for educational purposes.