Update readme
This commit is contained in:
parent
f9aab0628e
commit
27bb79cae7
13
README.md
13
README.md
|
@ -5,6 +5,9 @@ a text analysis university course.
|
||||||
|
|
||||||
We need:
|
We need:
|
||||||
|
|
||||||
|
[Guangdong](http://wsjkw.gd.gov.cn/gkmlpt/mindex#2531)
|
||||||
|
: No page numbers defined for this
|
||||||
|
|
||||||
[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
|
[Qinghai](https://wsjkw.qinghai.gov.cn/zwgk/xxgkml/index.html)
|
||||||
: page 14-75
|
: page 14-75
|
||||||
|
|
||||||
|
@ -17,7 +20,9 @@ We need:
|
||||||
[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
|
[Xinjiang](http://wjw.xinjiang.gov.cn/hfpc/zcwj4/zfxxgk_gknrz_10.shtml)
|
||||||
: page 10-20
|
: page 10-20
|
||||||
|
|
||||||
The websites all have subtle differences, so there's simply a folder +
|
Each of the folders contains a zip with the dumped txt files. In the
|
||||||
scripts for each (the scripts are simple enough that there's no need
|
zip, there is also a `links.csv` file, which links the txt files back
|
||||||
for deduplication or anything complex). Written in python/js where
|
up to their original links (in case some data sanitization is
|
||||||
necessary for educational purposes.
|
necessary). *Except* for Guangdong, where the links are in a
|
||||||
|
`links.txt` file, because scraping those was more difficult and the
|
||||||
|
page is down now so I can't fix this.
|
||||||
|
|
Loading…
Reference in a new issue