guangdong | ||
ningxia | ||
qinghai | ||
shanxi | ||
utils | ||
.gitignore | ||
flake.lock | ||
flake.nix | ||
README.md |
Province article scraping
A couple of scripts to scrape article text from various provinces for a text analysis university course.
We need:
- Guangdong
- No page numbers defined for this
- Qinghai
- page 14-75
- Ningxia
- page 11-42 (actually 8-44?)
- Shanxi
- page 2-18 (actually 2-20?)
- Xinjiang
- page 10-20
Each of the folders contains a zip with the dumped txt files. In the
zip, there is also a links.csv
file, which links the txt files back
up to their original links (in case some data sanitization is
necessary). Except for Guangdong, where the links are in a
links.txt
file, because scraping those was more difficult and the
page is down now so I can't fix this.