diff --git a/Readme.md b/Readme.md new file mode 100644 index 0000000..51ef5c8 --- /dev/null +++ b/Readme.md @@ -0,0 +1,23 @@ +# Province article scraping + +A couple of scripts to scrape article text from various provinces for +a text analysis university course. + +We need: + +Qinghai +: page 14-75 + +Ningxia +: page 11-42 + +Shanxi +: page 2-18 + +Xinjiang +: page 10-20 + +The websites all have subtle differences, so there's simply a folder + +scripts for each (the scripts are simple enough that there's no need +for deduplication or anything complex). Written in python/js where +necessary for educational purposes.