Find a file
2022-04-09 23:54:19 +01:00
guangdong Add guangdong readme 2022-04-09 23:34:56 +01:00
ningxia Add ningxia file dump 2022-04-09 22:37:27 +01:00
qinghai Add dumped qinghai articles 2022-04-09 19:57:08 +01:00
shanxi shanxi: Add readme 2022-04-09 23:48:31 +01:00
utils Implement scrape utils 2022-04-09 23:06:46 +01:00
.gitignore Ignore article directories 2022-04-09 22:35:15 +01:00
flake.lock Initial commit 2022-04-09 14:44:18 +01:00
flake.nix Add linkutils 2022-04-09 22:33:43 +01:00
README.md Update readme 2022-04-09 23:54:19 +01:00

Province article scraping

A couple of scripts to scrape article text from various provinces for a text analysis university course.

We need:

Guangdong
No page numbers defined for this
Qinghai
page 14-75
Ningxia
page 11-42 (actually 8-44?)
Shanxi
page 2-18 (actually 2-20?)
Xinjiang
page 10-20

Each of the folders contains a zip with the dumped txt files. In the zip, there is also a links.csv file, which links the txt files back up to their original links (in case some data sanitization is necessary). Except for Guangdong, where the links are in a links.txt file, because scraping those was more difficult and the page is down now so I can't fix this.