24 lines
475 B
Markdown
24 lines
475 B
Markdown
|
# Province article scraping
|
||
|
|
||
|
A couple of scripts to scrape article text from various provinces for
|
||
|
a text analysis university course.
|
||
|
|
||
|
We need:
|
||
|
|
||
|
Qinghai
|
||
|
: page 14-75
|
||
|
|
||
|
Ningxia
|
||
|
: page 11-42
|
||
|
|
||
|
Shanxi
|
||
|
: page 2-18
|
||
|
|
||
|
Xinjiang
|
||
|
: page 10-20
|
||
|
|
||
|
The websites all have subtle differences, so there's simply a folder +
|
||
|
scripts for each (the scripts are simple enough that there's no need
|
||
|
for deduplication or anything complex). Written in python/js where
|
||
|
necessary for educational purposes.
|