No description
| guangdong | ||
| ningxia | ||
| qinghai | ||
| shanxi | ||
| utils | ||
| .gitignore | ||
| flake.lock | ||
| flake.nix | ||
| README.md | ||
Province article scraping
A couple of scripts to scrape article text from various provinces for a text analysis university course.
We need:
- Guangdong
- No page numbers defined for this
- Qinghai
- page 14-75
- Ningxia
- page 11-42 (actually 8-44?)
- Shanxi
- page 2-18 (actually 2-20?)
- Xinjiang
- page 10-20
Each of the folders contains a zip with the dumped txt files. In the
zip, there is also a links.csv file, which links the txt files back
up to their original links (in case some data sanitization is
necessary). Except for Guangdong, where the links are in a
links.txt file, because scraping those was more difficult and the
page is down now so I can't fix this.