You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Tristan Daniël Maat 27bb79cae7
Update readme
12 months ago
guangdong Add guangdong readme 12 months ago
ningxia Add ningxia file dump 12 months ago
qinghai Add dumped qinghai articles 12 months ago
shanxi shanxi: Add readme 12 months ago
utils Implement scrape utils 12 months ago
.gitignore Ignore article directories 12 months ago
README.md Update readme 12 months ago
flake.lock Initial commit 12 months ago
flake.nix Add linkutils 12 months ago

README.md

Province article scraping

A couple of scripts to scrape article text from various provinces for a text analysis university course.

We need:

Guangdong
No page numbers defined for this
Qinghai
page 14-75
Ningxia
page 11-42 (actually 8-44?)
Shanxi
page 2-18 (actually 2-20?)
Xinjiang
page 10-20

Each of the folders contains a zip with the dumped txt files. In the zip, there is also a links.csv file, which links the txt files back up to their original links (in case some data sanitization is necessary). Except for Guangdong, where the links are in a links.txt file, because scraping those was more difficult and the page is down now so I can't fix this.