Update readme

2022-04-09 23:54:19 +01:00 · 2022-04-09 23:54:19 +01:00 · 27bb79cae7
commit 27bb79cae7
parent f9aab0628e
1 changed files with 9 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -5,6 +5,9 @@ a text analysis university course.
 We need:
 [Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531)
 : No page numbers defined for this
 [Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
 : page 14-75
@ -17,7 +20,9 @@ We need:
 [Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
 : page 10-20
-The websites all have subtle differences, so there's simply a folder +
+Each of the folders contains a zip with the dumped txt files. In the
-scripts for each (the scripts are simple enough that there's no need
+zip, there is also a `links.csv` file, which links the txt files back
-for deduplication or anything complex). Written in python/js where
+up to their original links (in case some data sanitization is
-necessary for educational purposes.
+necessary). *Except* for Guangdong, where the links are in a
 `links.txt` file, because scraping those was more difficult and the
 page is down now so I can't fix this.