Files

david.delagneau 73166b585f feat: add tooling and documentation for archiving Discourse content via Charles Proxy .chlsx sessions.

2026-05-19 14:28:45 -06:00

4.6 KiB

Raw Blame History

type, status, target, purpose

type	status	target	purpose
copilot-prompt	ready	github-copilot	Parse Charles .chlsx sessions to create a searchable Discourse archive

Copilot Prompt — Charles Discourse Archiver

Paste this into GitHub Copilot on the corporate device.

Prompt

You are helping me build a local searchable archive of a Discourse forum from captured Charles Proxy session files.

Background

I browse a Discourse forum in my browser while Charles Proxy records traffic. I save the session as a .chlsx file. Inside that file are all the HTTP request/response pairs for the pages I visited — including Discourse API calls that return structured JSON (topics, posts, categories, user profiles).

I need you to extract only the Discourse content and organize it into a Markdown archive that:

Is searchable by an AI in future sessions
Preserves topic titles, post authors, dates, and content
Groups by category
Deduplicates topics that appear across multiple sessions

File format: `.chlsx`

.chlsx is a ZIP archive. Inside are numbered XML files (e.g. 00001.xml, 00/00001.xml). Each XML file represents one HTTP request/response pair with this structure:

<session>
  <request>
    <method>GET</method>
    <url>https://forum.example.com/t/123.json</url>
    <protocol>HTTP/1.1</protocol>
    <header name="Cookie">...</header>
    <body></body>
  </request>
  <response>
    <status>200</status>
    <protocol>HTTP/1.1</protocol>
    <header name="Content-Type">application/json; charset=utf-8</header>
    <body>{"id": 123, "title": "Some Topic", "post_stream": {...}}</body>
  </response>
  <timing>
    <start>2026-01-15T10:30:00.000Z</start>
    <duration>450</duration>
  </timing>
</session>

Discourse API endpoints to extract

What	URL pattern	JSON fields
Latest topics	`/latest.json`	`topic_list.topics[].{id, title, slug, category_id, created_at, last_posted_at}`
Category index	`/categories.json`	`category_list.categories[].{id, name, slug}`
Single topic (with posts)	`/t/{id}.json`	`id, title, slug, category_id, post_stream.posts[].{username, cooked, created_at, post_number}`
Topic with page	`/t/{id}/{page}.json`	Same as above, paginated
User activity	`/u/{username}/activity.json`	`user_actions[]`
Search results	`/search.json?q=...`	`topics[]`, `posts[]`

What to do

Open the .chlsx file as a ZIP archive.
List all XML files inside (both flat and in subdirectories).
For each XML file, parse it and check if the request URL matches one of the Discourse endpoints above.
Skip: CSS, JS, images, font files, analytics, CDN assets, and any non-Discourse endpoint.
Parse the JSON response body from <response><body>.
Create this folder structure as output:

discourse-archive/
├── categories.json          # All categories found
├── index.md                 # Master index (table of all topics with ID, title, date, category, URL)
├── topics/
│   ├── 123-your-topic-slug.md
│   ├── 456-another-topic.md
│   └── ...

Markdown format per topic

Each topic file should be a clean Markdown document with YAML frontmatter:

---
id: 123
title: "Your Topic Title"
slug: your-topic-slug
category: "Category Name"
created: 2026-01-15
updated: 2026-01-16
url: https://forum.example.com/t/your-topic-slug/123
---

# Your Topic Title

**Category**: Category Name

---

## Post 1 — @username1 (2026-01-15T10:30:00Z)

Post content here (HTML stripped, plain Markdown preferred).

---

## Post 2 — @username2 (2026-01-16T14:00:00Z)

More content.

---

Deduplication rules

If the same topic ID appears in multiple .chlsx files, keep the one with the most recent last_posted_at.
If a session has page 2+ of a topic (/t/123/2.json), merge the posts with page 1.
Never duplicate posts within a topic.

What to do with the output

Place the resulting discourse-archive/ folder in a location I can reference in future Copilot sessions. I will point Copilot to that folder when I need to search past Discourse conversations.

Constraints

Do not modify the original .chlsx file.
Do not upload or send the extracted data anywhere — keep it local.
If a topic has no readable content (deleted, access restricted), note it in the index but skip the full extraction.
HTML in cooked fields should be converted to readable plain text / Markdown (Discourse stores posts as HTML in the JSON).

First action

Ask me for:

The path to the .chlsx file (or files)
The Discourse base URL (so you can construct canonical topic URLs)
Where I want the output folder created

4.6 KiB Raw Blame History