In a previous post, I explained how AI can help tag posts in a blog. That approach eases the burden of implementing an existing organizational scheme. But what if you don’t have a taxonomy to begin with? How can you use AI to help define categories?
I’m always looking for opportunities to experiment with AI for information architecture work. Before trying it with client projects, I prefer to experiment with my own stuff. With this in mind, I used AI to curate a “year in review” episode of my podcast.
The experiment was partly successful, and this approach can help with other common IA challenges. Let’s dive in to see how I did it.
Project Objectives
Here’s the problem to be solved: I’m apprehensive about publishing episodes between Christmas and New Year’s. Lots of folks are on vacation, so downloads are down. Releasing an interview on that week is unfair to the guest.
Compilation episodes are a common alternative. I released one in 2021 featuring five themes that emerged in the year, with episode highlights for each. The final result was effective but took longer to produce than I expected. So, I skipped doing one in 2022.
I want to produce similar shows faster. The biggest time sink in 2021 was reviewing each of the year’s episodes to identify themes and clips. Finding patterns in text is what language models are good at, so it’s what I set out to do. And, of course, categorizing content is central to many other IA challenges.
I expected two things from the AI:
- A set of categories (themes)
- Clips from individual episodes that fit those themes
That’s the extent of it. I didn’t want a final edit, just a first draft I could build on. It’s how I use AI: not as a replacement but as an assistant.
Preparing the Content
The first step was preparing the content the AI would analyze. I had it relatively easy: each podcast episode includes a full transcript. Since the show’s site is published using Jekyll, these transcripts are in plain text (Markdown) format.
I collected the year’s transcripts and copied them to a temporary directory on my Mac. The command line is faster than the Finder for this kind of thing:
cp ~/Sites/theinformed_life/_posts/2023-* ~/Desktop/temp/
This command moved all transcript files that start with the character sequence 2023-
to a directory called temp on my Desktop:
2023-01-01-episode-104-marcia-bates.md
2023-01-14-episode-105-david-rose.md
2023-01-29-episode-106-leonie-watson.md
2023-02-12-episode-107-michael-becker.md
2023-02-26-episode-108-carrie-hane.md
Etc.
Great first step. But whole episodes weren’t granular enough for the AI to find meaningful patterns. Single conversations can touch on several themes, so content needed to be more granular.
Show transcripts have chapters that make ideal break points. Each starts with an H3 heading. In Markdown, H3s start with three hashes (###), like this:
### Chapter Title
So the next step was breaking up each transcript every time the ###
string came up. Again, this goes faster using the command line. I used a tool called gcsplit that splits text files at specific delimiters:
for file in *.md; do gcsplit $file -f $file-section- ’/\#\#\#/’ ’{*}’; done;
This command looks through each file that ends in *.md in the current directory and splits it when it finds ###. (Hashes have a particular meaning here, so they must be escaped with backslashes.)
Running this command in the temp directory yielded the following results:
2023-01-01-episode-104-marcia-bates.md
2023-01-01-episode-104-marcia-bates.md-section-00
2023-01-01-episode-104-marcia-bates.md-section-01
2023-01-01-episode-104-marcia-bates.md-section-02
2023-01-01-episode-104-marcia-bates.md-section-03
2023-01-01-episode-104-marcia-bates.md-section-04
2023-01-01-episode-104-marcia-bates.md-section-05
2023-01-01-episode-104-marcia-bates.md-section-06
2023-01-01-episode-104-marcia-bates.md-section-07
2023-01-14-episode-105-david-rose.md
2023-01-14-episode-105-david-rose.md-section-00
2023-01-14-episode-105-david-rose.md-section-01
2023-01-14-episode-105-david-rose.md-section-02
2023-01-14-episode-105-david-rose.md-section-03
2023-01-14-episode-105-david-rose.md-section-04
2023-01-14-episode-105-david-rose.md-section-05
Etc.
Each section file contains one chapter for one episode. Note that I directed gcsplit to include the original filename in the output so I could trace each chapter back to its source episode. I then moved these section files to a directory called sections. These are the files the AI will analyze.
But there was still one step left. All episodes begin with an introduction and end with the guest telling us where listeners can follow up with them. I didn’t want these chapters in the corpus, so I deleted all section files ending in -00 and the last one for each episode.
With those tweaks, I was ready to begin working with the AI.
Working With the AI
At this point, the content was ready to go. The next step was to have the AI process these section files to spot affinities. As with my previous experiment, I used Simon Willison’s llm open source tools to access OpenAI’s API from the command line.
First, I had to create a database of embeddings for the podcast chapters. Embeddings represent content items in a multi-dimensional space. Each location in this space represents an item’s semantic meanings. (Read Willison’s explanation.)
The llm tool has a command for working with embeddings:
llm embed-multi chapters \
-m ada-002 \
--files chapters ’*.md-*’ \
-d chapters.db \
--store
This embed-multi command creates a collection called chapters in a SQLite database called chapters.db. It also creates embeddings for all files in the chapters directory whose name contains the string .md-
and stores them (along with their content) in this database. I specified that I wanted to use OpenAI’s ada-002 model. (You can use other models, including local ones.)
With this database in place, I used the llm-cluster plugin to have GPT-4 find possible relationships between content items. (In this case, podcast episode chapters.)
llm cluster chapters 6 \
-d chapters.db \
–model gpt–4 \
–summary > clusters.json
This plugin produces clusters (lists) of related items. In this case, I told it to create six clusters from the embeddings in the chapters collection of the chapters database. I also told it to add a summary description to each cluster. The output is then saved to a JSON file.
At this point, I was done with the AI. The six clusters were the first draft of the episode themes. The next step was manually reviewing the content in these proposed clusters.
Manual Edits
The JSON file generated by the LLM included six clusters — i.e., possible themes — and a list of section files included in each. I next needed to create separate directories for each theme containing the proposed section files so I could review them.
Parsing JSON by hand isn’t fun, so I exported the data in different formats using the jq command line tool. For example, the following command
jq -r ’.[] | .id + “ - ” + .summary’ clusters.json
produced a list of the clusters and their summaries:
0 - "Exploring the Intersection of Music, Information Encoding, and Appreciation"
1 - "Conversations on Digital Culture, Writing, and Anthropology"
2 - "Exploring Curiosity, Reading Habits, and Library Organization"
3 - "Discussions on Design Thinking and Systems Approach"
4 - "Exploring Knowledge Management and Digital Tools"
5 - "Content Creation and Information Architecture Strategies"
I also used jq to list the individual section files included in each cluster:
jq -r ’.[] | select (.id | contains(“0”)) | .items[].id’ clusters.json > theme-0.md
Here, 0
is the index for the first cluster above. I re-ran the command for clusters 1, 2, 3, etc. (I’d use a for loop if dealing with more than a few clusters.)
Next, I created a directory called themes and subdirectories for each theme, named 0, 1, 2, 3, 4, and 5. I then copied the appropriate sections from the chapters directory to each numbered subdirectory inside themes. Again, this is faster using the command line:
while read p; do cp “chapters/$p” “themes/0/”; done < theme-0.md
As above, I repeated this command six times, changing the number each time. (Again, I’d use a for loop if dealing with more than a few themes.) After running this command, I had six theme directories containing section files from various episodes.
Finally, I reviewed each theme to see which section files made the most sense. Some I discarded; others I kept. I also played around with the themes themselves. I built the final script as I went, including relevant episode highlights. (Noting their sources, of course.)
Final Result
The episode went live on December 31, right on schedule. This post might give the impression that it was a straightforward process. It wasn’t. I had to experiment to get useful results.
For example, I tried different models for clustering. (I got better results from gpt-4 than ada-002.) I also experimented with varying quantities of clusters, from four (which grouped too many irrelevant conversations) to ten (which created a few clusters featuring single episodes.)
You’ll notice the final episode highlights four themes (not six) and their names are different from those listed above. Since the point was to showcase broad themes, I ignored clusters that included sections from only one or two shows. Also, the AI’s summaries were too literal and granular, so I renamed them.
Which is to say, the final product is the result of manual categorization. The AI suggested possible groupings, but I defined the final themes. Some chapters fit better with my groupings than the AI’s, so I moved them. I also chose the final labels.
After all, the point wasn’t having the AI create the final structure. Instead, I wanted an acceptable first draft. The AI as assistant, not as replacement. It performed well in that role.
To summarize, the steps were:
- Defining project objectives
- Defining the right level of content granularity
- Preparing content
- Having the AI create embeddings for the content
- Having the AI define clusters from the embeddings
- Manually curating the results
Is the result perfect? No. But it’s as good as I would’ve produced on my own. That said, I spent almost as much time on it as on the 2021 compilation — but that’s because I was experimenting with a new approach. Efficiency isn’t my goal when learning new things. And having learned this technique, I’ll be more efficient the next time I must categorize lots of content. Hopefully, you will, too.