Update readme
This commit is contained in:
parent
f9aab0628e
commit
27bb79cae7
13
README.md
13
README.md
|
@ -5,6 +5,9 @@ a text analysis university course.
|
|||
|
||||
We need:
|
||||
|
||||
[Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531)
|
||||
: No page numbers defined for this
|
||||
|
||||
[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
|
||||
: page 14-75
|
||||
|
||||
|
@ -17,7 +20,9 @@ We need:
|
|||
[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
|
||||
: page 10-20
|
||||
|
||||
The websites all have subtle differences, so there's simply a folder +
|
||||
scripts for each (the scripts are simple enough that there's no need
|
||||
for deduplication or anything complex). Written in python/js where
|
||||
necessary for educational purposes.
|
||||
Each of the folders contains a zip with the dumped txt files. In the
|
||||
zip, there is also a `links.csv` file, which links the txt files back
|
||||
up to their original links (in case some data sanitization is
|
||||
necessary). *Except* for Guangdong, where the links are in a
|
||||
`links.txt` file, because scraping those was more difficult and the
|
||||
page is down now so I can't fix this.
|
||||
|
|
Loading…
Reference in a new issue