Extracting a TOC from Markup

In today’s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:

So this is pretty simple, just use regular expressions to look for lines that start with one or more "#" or "=" (for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent ## heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I’ve also implemented some simple type detection using common extensions to decide which regex to use.

The result is a quick view of the structure of a markup file, especially when they can get overly large. From the Markdown of one of my longer blog posts:

- A Practical Guide to Anonymizing Datasets with Python
    - Anonymizing CSV Data
        - Generating Fake Data
        - Creating A Provider
    - Maintaining Data Quality
        - Domain Distribution
        - Realistic Profiles
        - Fuzzing Fake Names from Duplicates
    - Conclusion
        - Acknowledgments
        - Footnotes

And from the first chapter of Applied Text Analysis with Python:

- Language and Computation
    - What is Language?
        - Identifying the Basic Units of Language
        - Formal vs. Natural Languages
            - Formal Languages
            - Natural Languages
    - Language Models
        - Language Features
        - Contextual Features
        - Structural Features
        - The Academic State of the Art
    - Tools for Natural Language Processing
    - Language Aware Data Products
    - Conclusion

Ok, so clearly there are some bugs, those two blank - bullet points are a note callout which has the form:

Insert note text here.

Therefore misidentifying the first and second ==== as a level 4 heading. I tried a couple of regular expression fixes for this, but couldn’t exactly get it. The next step is to add a simple loop to do multiple paths so that I can print out the table of contents for an entire directory (e.g. to get the TOC for the entire book where one chapter == one file).