Go to file
Tristan Daniël Maat f7cf03d442
Add dumped qinghai articles
2022-04-09 19:57:08 +01:00
guangdong Add typescript-language-server 2022-04-09 17:43:37 +01:00
qinghai Add dumped qinghai articles 2022-04-09 19:57:08 +01:00
Readme.md Add page URLs to Readme 2022-04-09 17:45:35 +01:00
flake.lock Initial commit 2022-04-09 14:44:18 +01:00
flake.nix Add zip and unzip 2022-04-09 19:31:46 +01:00

Readme.md

Province article scraping

A couple of scripts to scrape article text from various provinces for a text analysis university course.

We need:

Qinghai
page 14-75
Ningxia
page 11-42
Shanxi
page 2-18
Xinjiang
page 10-20

The websites all have subtle differences, so there's simply a folder + scripts for each (the scripts are simple enough that there's no need for deduplication or anything complex). Written in python/js where necessary for educational purposes.