From 27bb79cae7b59907553161424ec8edbf41298499 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Tristan=20Dani=C3=ABl=20Maat?= <tm@tlater.net>
Date: Sat, 9 Apr 2022 23:54:19 +0100
Subject: [PATCH] Update readme

---
 README.md | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index a94e6fe..92ef49c 100644
--- a/README.md
+++ b/README.md
@@ -5,6 +5,9 @@ a text analysis university course.
 
 We need:
 
+[Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531)
+: No page numbers defined for this
+
 [Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
 : page 14-75
 
@@ -17,7 +20,9 @@ We need:
 [Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
 : page 10-20
 
-The websites all have subtle differences, so there's simply a folder +
-scripts for each (the scripts are simple enough that there's no need
-for deduplication or anything complex). Written in python/js where
-necessary for educational purposes.
+Each of the folders contains a zip with the dumped txt files. In the
+zip, there is also a `links.csv` file, which links the txt files back
+up to their original links (in case some data sanitization is
+necessary). *Except* for Guangdong, where the links are in a
+`links.txt` file, because scraping those was more difficult and the
+page is down now so I can't fix this.