From 9030da9a0c34f4547070b8dae7160966fea341a9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tristan=20Dani=C3=ABl=20Maat?= Date: Sat, 9 Apr 2022 17:43:47 +0100 Subject: [PATCH] Add Readme --- Readme.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 Readme.md diff --git a/Readme.md b/Readme.md new file mode 100644 index 0000000..51ef5c8 --- /dev/null +++ b/Readme.md @@ -0,0 +1,23 @@ +# Province article scraping + +A couple of scripts to scrape article text from various provinces for +a text analysis university course. + +We need: + +Qinghai +: page 14-75 + +Ningxia +: page 11-42 + +Shanxi +: page 2-18 + +Xinjiang +: page 10-20 + +The websites all have subtle differences, so there's simply a folder + +scripts for each (the scripts are simple enough that there's no need +for deduplication or anything complex). Written in python/js where +necessary for educational purposes.