You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
12 months ago | |
---|---|---|
guangdong | 12 months ago | |
ningxia | 12 months ago | |
qinghai | 12 months ago | |
shanxi | 12 months ago | |
utils | 12 months ago | |
.gitignore | 12 months ago | |
README.md | 12 months ago | |
flake.lock | 12 months ago | |
flake.nix | 12 months ago |
README.md
Province article scraping
A couple of scripts to scrape article text from various provinces for a text analysis university course.
We need:
- Guangdong
- No page numbers defined for this
- Qinghai
- page 14-75
- Ningxia
- page 11-42 (actually 8-44?)
- Shanxi
- page 2-18 (actually 2-20?)
- Xinjiang
- page 10-20
Each of the folders contains a zip with the dumped txt files. In the
zip, there is also a links.csv
file, which links the txt files back
up to their original links (in case some data sanitization is
necessary). Except for Guangdong, where the links are in a
links.txt
file, because scraping those was more difficult and the
page is down now so I can't fix this.