# Province article scraping A couple of scripts to scrape article text from various provinces for a text analysis university course. We need: [Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531) : No page numbers defined for this [Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html) : page 14-75 [Ningxia](http://wsjkw.nx.gov.cn/xwzx_279/tzgg/index.html) : page 11-42 (actually 8-44?) [Shanxi](http://sxwjw.shaanxi.gov.cn/zfxxgk/fdzdgknr/zcwj/xzgfxwj/index.html) : page 2-18 (actually 2-20?) [Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml) : page 10-20 Each of the folders contains a zip with the dumped txt files. In the zip, there is also a `links.csv` file, which links the txt files back up to their original links (in case some data sanitization is necessary). *Except* for Guangdong, where the links are in a `links.txt` file, because scraping those was more difficult and the page is down now so I can't fix this.