4.6 KiB
type, status, target, purpose
| type | status | target | purpose |
|---|---|---|---|
| copilot-prompt | ready | github-copilot | Parse Charles .chlsx sessions to create a searchable Discourse archive |
Copilot Prompt — Charles Discourse Archiver
Paste this into GitHub Copilot on the corporate device.
Prompt
You are helping me build a local searchable archive of a Discourse forum from captured Charles Proxy session files.
Background
I browse a Discourse forum in my browser while Charles Proxy records traffic. I save the session as a .chlsx file. Inside that file are all the HTTP request/response pairs for the pages I visited — including Discourse API calls that return structured JSON (topics, posts, categories, user profiles).
I need you to extract only the Discourse content and organize it into a Markdown archive that:
- Is searchable by an AI in future sessions
- Preserves topic titles, post authors, dates, and content
- Groups by category
- Deduplicates topics that appear across multiple sessions
File format: .chlsx
.chlsx is a ZIP archive. Inside are numbered XML files (e.g. 00001.xml, 00/00001.xml). Each XML file represents one HTTP request/response pair with this structure:
<session>
<request>
<method>GET</method>
<url>https://forum.example.com/t/123.json</url>
<protocol>HTTP/1.1</protocol>
<header name="Cookie">...</header>
<body></body>
</request>
<response>
<status>200</status>
<protocol>HTTP/1.1</protocol>
<header name="Content-Type">application/json; charset=utf-8</header>
<body>{"id": 123, "title": "Some Topic", "post_stream": {...}}</body>
</response>
<timing>
<start>2026-01-15T10:30:00.000Z</start>
<duration>450</duration>
</timing>
</session>
Discourse API endpoints to extract
| What | URL pattern | JSON fields |
|---|---|---|
| Latest topics | /latest.json |
topic_list.topics[].{id, title, slug, category_id, created_at, last_posted_at} |
| Category index | /categories.json |
category_list.categories[].{id, name, slug} |
| Single topic (with posts) | /t/{id}.json |
id, title, slug, category_id, post_stream.posts[].{username, cooked, created_at, post_number} |
| Topic with page | /t/{id}/{page}.json |
Same as above, paginated |
| User activity | /u/{username}/activity.json |
user_actions[] |
| Search results | /search.json?q=... |
topics[], posts[] |
What to do
- Open the
.chlsxfile as a ZIP archive. - List all XML files inside (both flat and in subdirectories).
- For each XML file, parse it and check if the request URL matches one of the Discourse endpoints above.
- Skip: CSS, JS, images, font files, analytics, CDN assets, and any non-Discourse endpoint.
- Parse the JSON response body from
<response><body>. - Create this folder structure as output:
discourse-archive/
├── categories.json # All categories found
├── index.md # Master index (table of all topics with ID, title, date, category, URL)
├── topics/
│ ├── 123-your-topic-slug.md
│ ├── 456-another-topic.md
│ └── ...
Markdown format per topic
Each topic file should be a clean Markdown document with YAML frontmatter:
---
id: 123
title: "Your Topic Title"
slug: your-topic-slug
category: "Category Name"
created: 2026-01-15
updated: 2026-01-16
url: https://forum.example.com/t/your-topic-slug/123
---
# Your Topic Title
**Category**: Category Name
---
## Post 1 — @username1 (2026-01-15T10:30:00Z)
Post content here (HTML stripped, plain Markdown preferred).
---
## Post 2 — @username2 (2026-01-16T14:00:00Z)
More content.
---
Deduplication rules
- If the same topic ID appears in multiple
.chlsxfiles, keep the one with the most recentlast_posted_at. - If a session has page 2+ of a topic (
/t/123/2.json), merge the posts with page 1. - Never duplicate posts within a topic.
What to do with the output
Place the resulting discourse-archive/ folder in a location I can reference in future Copilot sessions. I will point Copilot to that folder when I need to search past Discourse conversations.
Constraints
- Do not modify the original
.chlsxfile. - Do not upload or send the extracted data anywhere — keep it local.
- If a topic has no readable content (deleted, access restricted), note it in the index but skip the full extraction.
- HTML in
cookedfields should be converted to readable plain text / Markdown (Discourse stores posts as HTML in the JSON).
First action
Ask me for:
- The path to the
.chlsxfile (or files) - The Discourse base URL (so you can construct canonical topic URLs)
- Where I want the output folder created