Plain text won. As the lingua franca for interacting with LLMs, Markdown is now ubiquitous. To wit: Microsoft recently released MarkItDown, an open source tool that converts various types of files (including DOCX, PPTX, and PDFs) to Markdown.
If you’re into AI, this is a godsend. It’s easier for an LLM to process a text string than a binary blob. Much of the stuff that matters most in many of these files can be rendered as plain text anyway.
MarkItDown is only the latest. Lots of other tools now import and export Markdown. For some classes of apps, it’s become almost an expected feature. There are even apps like Obsidian that use plain text Markdown as their standard data format.
This is a big deal if you work with LLMs. I’ve written about my experiments using GraphRAG for IA work. One limitation of that approach is that the corpus for building the knowledge graph must consist of plain text.
As a result, I’ve limited experiments to web pages and (sporadic) PDF files, which can be easily converted to plain text. But much content in today’s corporate world lives in PowerPoint decks, Word Docs, and Excel files. Making them accessible to the LLM required extra steps.
MarkItDown automates the conversion of such files to plain text. That makes building a more comprehensive corpus more feasible. It opens up tools like GraphRAG for use cases that would’ve been impractical before.
That’s exciting enough. But the big picture is what matters most: plain text – the most vanilla of digital formats! – is back.
There was a time when closed, proprietary formats seemed destined to rule. In the work world, information was shared in DOCs, XLSs, etc. Design work happened in PSDs and AI files. (That’s Adobe Illustrator, not artificial intelligence.)
Sure, these formats provide capabilities plain text can’t match. They’re also likely the most efficient way of saving that type of data. But they also have downsides.
For one, they’re not as portable to other apps. (Often, by design.) That’s a big deal, particularly if you aim to save stuff for the long term. While there will likely be apps that open PSDs and DOCs for many years, there are more obscure proprietary formats that aren’t as widely supported. If you have lots of data stored in such formats, you could be stuck in a few decades as the computing world moves on.
Now, much of my computing life revolves around three apps: a web browser, a text editor, and a terminal. This gives me peace of mind and unprecedented control. Beyond long-term compatibility, storing data in plain text makes it easier to process, integrate with other apps, backup, etc.
The current situation is a significant improvement over where the computing world was headed decades ago. Plain text is as standard and open as it gets. It’s the one format that’s likely to be supported the longest and most widely. We have LLMs to thank for its resurgence.