No description
| guangdong | ||
| ningxia | ||
| qinghai | ||
| shanxi | ||
| utils | ||
| .gitignore | ||
| flake.lock | ||
| flake.nix | ||
| README.md | ||
Province article scraping
A couple of scripts to scrape article text from various provinces for a text analysis university course.
We need:
- Qinghai
- page 14-75
- Ningxia
- page 11-42 (actually 8-44?)
- Shanxi
- page 2-18 (actually 2-20?)
- Xinjiang
- page 10-20
The websites all have subtle differences, so there's simply a folder + scripts for each (the scripts are simple enough that there's no need for deduplication or anything complex). Written in python/js where necessary for educational purposes.