scrape-yuanyuan/README.md

749 B

Province article scraping

A couple of scripts to scrape article text from various provinces for a text analysis university course.

We need:

Qinghai
page 14-75
Ningxia
page 11-42 (actually 8-44?)
Shanxi
page 2-18 (actually 2-20?)
Xinjiang
page 10-20

The websites all have subtle differences, so there's simply a folder + scripts for each (the scripts are simple enough that there's no need for deduplication or anything complex). Written in python/js where necessary for educational purposes.