From 27bb79cae7b59907553161424ec8edbf41298499 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tristan=20Dani=C3=ABl=20Maat?= Date: Sat, 9 Apr 2022 23:54:19 +0100 Subject: [PATCH] Update readme --- README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index a94e6fe..92ef49c 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,9 @@ a text analysis university course. We need: +[Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531) +: No page numbers defined for this + [Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html) : page 14-75 @@ -17,7 +20,9 @@ We need: [Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml) : page 10-20 -The websites all have subtle differences, so there's simply a folder + -scripts for each (the scripts are simple enough that there's no need -for deduplication or anything complex). Written in python/js where -necessary for educational purposes. +Each of the folders contains a zip with the dumped txt files. In the +zip, there is also a `links.csv` file, which links the txt files back +up to their original links (in case some data sanitization is +necessary). *Except* for Guangdong, where the links are in a +`links.txt` file, because scraping those was more difficult and the +page is down now so I can't fix this.