scrape-yuanyuan/Readme.md

# Province article scraping

A couple of scripts to scrape article text from various provinces for
a text analysis university course.

We need:

[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
: page 14-75

[Ningxia](http://wsjkw.nx.gov.cn/xwzx_279/tzgg/index.html)
: page 11-42

[Shanxi](http://sxwjw.shaanxi.gov.cn/zfxxgk/fdzdgknr/zcwj/xzgfxwj/index.html)
: page 2-18

[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
: page 10-20

The websites all have subtle differences, so there's simply a folder +
scripts for each (the scripts are simple enough that there's no need
for deduplication or anything complex). Written in python/js where
necessary for educational purposes.
Add Readme 2022-04-09 17:43:47 +01:00			`# Province article scraping`

			`A couple of scripts to scrape article text from various provinces for`
			`a text analysis university course.`

			`We need:`

Add page URLs to Readme 2022-04-09 17:45:35 +01:00			`[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)`
Add Readme 2022-04-09 17:43:47 +01:00			`: page 14-75`

Add page URLs to Readme 2022-04-09 17:45:35 +01:00			`[Ningxia](http://wsjkw.nx.gov.cn/xwzx_279/tzgg/index.html)`
Add Readme 2022-04-09 17:43:47 +01:00			`: page 11-42`

Add page URLs to Readme 2022-04-09 17:45:35 +01:00			`[Shanxi](http://sxwjw.shaanxi.gov.cn/zfxxgk/fdzdgknr/zcwj/xzgfxwj/index.html)`
Add Readme 2022-04-09 17:43:47 +01:00			`: page 2-18`

Add page URLs to Readme 2022-04-09 17:45:35 +01:00			`[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)`
Add Readme 2022-04-09 17:43:47 +01:00			`: page 10-20`

			`The websites all have subtle differences, so there's simply a folder +`
			`scripts for each (the scripts are simple enough that there's no need`
			`for deduplication or anything complex). Written in python/js where`
			`necessary for educational purposes.`