In today’s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:
So this is pretty simple, just use regular expressions to look for lines that start with one or more "#"
or "="
(for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent ##
heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I’ve also implemented some simple type detection using common extensions to decide which regex to use.
The result is a quick view of the structure of a markup file, especially when they can get overly large. From the Markdown of one of my longer blog posts:
- A Practical Guide to Anonymizing Datasets with Python
- Anonymizing CSV Data
- Generating Fake Data
- Creating A Provider
- Maintaining Data Quality
- Domain Distribution
- Realistic Profiles
- Fuzzing Fake Names from Duplicates
- Conclusion
- Acknowledgments
- Footnotes
And from the first chapter of Applied Text Analysis with Python:
- Language and Computation
-
-
- What is Language?
- Identifying the Basic Units of Language
- Formal vs. Natural Languages
- Formal Languages
- Natural Languages
- Language Models
- Language Features
- Contextual Features
- Structural Features
- The Academic State of the Art
- Tools for Natural Language Processing
- Language Aware Data Products
- Conclusion
Ok, so clearly there are some bugs, those two blank -
bullet points are a note callout which has the form:
[NOTE]
====
Insert note text here.
====
Therefore misidentifying the first and second ====
as a level 4 heading. I tried a couple of regular expression fixes for this, but couldn’t exactly get it. The next step is to add a simple loop to do multiple paths so that I can print out the table of contents for an entire directory (e.g. to get the TOC for the entire book where one chapter == one file).