No description

Find a file

Tristan Daniël Maat 27bb79cae7 Update readme		2022-04-09 23:54:19 +01:00
guangdong	Add guangdong readme	2022-04-09 23:34:56 +01:00
ningxia	Add ningxia file dump	2022-04-09 22:37:27 +01:00
qinghai	Add dumped qinghai articles	2022-04-09 19:57:08 +01:00
shanxi	shanxi: Add readme	2022-04-09 23:48:31 +01:00
utils	Implement scrape utils	2022-04-09 23:06:46 +01:00
.gitignore	Ignore article directories	2022-04-09 22:35:15 +01:00
flake.lock	Initial commit	2022-04-09 14:44:18 +01:00
flake.nix	Add linkutils	2022-04-09 22:33:43 +01:00
README.md	Update readme	2022-04-09 23:54:19 +01:00

README.md

Province article scraping

A couple of scripts to scrape article text from various provinces for a text analysis university course.

We need:

Guangdong: No page numbers defined for this
Qinghai: page 14-75
Ningxia: page 11-42 (actually 8-44?)
Shanxi: page 2-18 (actually 2-20?)
Xinjiang: page 10-20

Each of the folders contains a zip with the dumped txt files. In the zip, there is also a links.csv file, which links the txt files back up to their original links (in case some data sanitization is necessary). Except for Guangdong, where the links are in a links.txt file, because scraping those was more difficult and the page is down now so I can't fix this.