Update readme

main
Tristan Daniël Maat 2022-04-09 23:54:19 +01:00
parent f9aab0628e
commit 27bb79cae7
Signed by: tlater
GPG Key ID: 49670FD774E43268
1 changed files with 9 additions and 4 deletions

View File

@ -5,6 +5,9 @@ a text analysis university course.
We need:
[Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531)
: No page numbers defined for this
[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
: page 14-75
@ -17,7 +20,9 @@ We need:
[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
: page 10-20
The websites all have subtle differences, so there's simply a folder +
scripts for each (the scripts are simple enough that there's no need
for deduplication or anything complex). Written in python/js where
necessary for educational purposes.
Each of the folders contains a zip with the dumped txt files. In the
zip, there is also a `links.csv` file, which links the txt files back
up to their original links (in case some data sanitization is
necessary). *Except* for Guangdong, where the links are in a
`links.txt` file, because scraping those was more difficult and the
page is down now so I can't fix this.