NLTK Corpus Reader for Extracted Corpus
Yesterday I wrote a blog about [extracting a corpus]({% post_url 2016-04-10-extract-ddl-corpus %}) from a directory containing Markdown, such as for a blog that is deployed with Silvrback or Jekyll. In this post, I’ll briefly show how to use the built in CorpusReader objects in nltk for streaming the data to the segmentation and tokenization preprocessing functions that are built into NLTK for performing analytics. The dataset that I’ll be working with is the District Data Labs Blog, in particular the state of the blog as of today. The dataset can be downloaded from the ddl corpus, which also has the code in this post for you to use to perform other analytics. ...