2022-04-09 17:43:47 +01:00
|
|
|
# Province article scraping
|
|
|
|
|
|
|
|
A couple of scripts to scrape article text from various provinces for
|
|
|
|
a text analysis university course.
|
|
|
|
|
|
|
|
We need:
|
|
|
|
|
2022-04-09 23:54:19 +01:00
|
|
|
[Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531)
|
|
|
|
: No page numbers defined for this
|
|
|
|
|
2022-04-09 17:45:35 +01:00
|
|
|
[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
|
2022-04-09 17:43:47 +01:00
|
|
|
: page 14-75
|
|
|
|
|
2022-04-09 17:45:35 +01:00
|
|
|
[Ningxia](http://wsjkw.nx.gov.cn/xwzx_279/tzgg/index.html)
|
2022-04-09 22:32:13 +01:00
|
|
|
: page 11-42 (actually 8-44?)
|
2022-04-09 17:43:47 +01:00
|
|
|
|
2022-04-09 17:45:35 +01:00
|
|
|
[Shanxi](http://sxwjw.shaanxi.gov.cn/zfxxgk/fdzdgknr/zcwj/xzgfxwj/index.html)
|
2022-04-09 23:06:27 +01:00
|
|
|
: page 2-18 (actually 2-20?)
|
2022-04-09 17:43:47 +01:00
|
|
|
|
2022-04-09 17:45:35 +01:00
|
|
|
[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
|
2022-04-09 17:43:47 +01:00
|
|
|
: page 10-20
|
|
|
|
|
2022-04-09 23:54:19 +01:00
|
|
|
Each of the folders contains a zip with the dumped txt files. In the
|
|
|
|
zip, there is also a `links.csv` file, which links the txt files back
|
|
|
|
up to their original links (in case some data sanitization is
|
|
|
|
necessary). *Except* for Guangdong, where the links are in a
|
|
|
|
`links.txt` file, because scraping those was more difficult and the
|
|
|
|
page is down now so I can't fix this.
|