141 lines
4.6 KiB
Markdown
141 lines
4.6 KiB
Markdown
---
|
|
type: copilot-prompt
|
|
status: ready
|
|
target: github-copilot
|
|
purpose: Parse Charles .chlsx sessions to create a searchable Discourse archive
|
|
---
|
|
|
|
# Copilot Prompt — Charles Discourse Archiver
|
|
|
|
Paste this into GitHub Copilot on the corporate device.
|
|
|
|
---
|
|
|
|
## Prompt
|
|
|
|
You are helping me build a local searchable archive of a Discourse forum from captured Charles Proxy session files.
|
|
|
|
### Background
|
|
|
|
I browse a Discourse forum in my browser while Charles Proxy records traffic. I save the session as a `.chlsx` file. Inside that file are all the HTTP request/response pairs for the pages I visited — including Discourse API calls that return structured JSON (topics, posts, categories, user profiles).
|
|
|
|
I need you to extract only the Discourse content and organize it into a Markdown archive that:
|
|
- Is searchable by an AI in future sessions
|
|
- Preserves topic titles, post authors, dates, and content
|
|
- Groups by category
|
|
- Deduplicates topics that appear across multiple sessions
|
|
|
|
### File format: `.chlsx`
|
|
|
|
`.chlsx` is a ZIP archive. Inside are numbered XML files (e.g. `00001.xml`, `00/00001.xml`). Each XML file represents one HTTP request/response pair with this structure:
|
|
|
|
```xml
|
|
<session>
|
|
<request>
|
|
<method>GET</method>
|
|
<url>https://forum.example.com/t/123.json</url>
|
|
<protocol>HTTP/1.1</protocol>
|
|
<header name="Cookie">...</header>
|
|
<body></body>
|
|
</request>
|
|
<response>
|
|
<status>200</status>
|
|
<protocol>HTTP/1.1</protocol>
|
|
<header name="Content-Type">application/json; charset=utf-8</header>
|
|
<body>{"id": 123, "title": "Some Topic", "post_stream": {...}}</body>
|
|
</response>
|
|
<timing>
|
|
<start>2026-01-15T10:30:00.000Z</start>
|
|
<duration>450</duration>
|
|
</timing>
|
|
</session>
|
|
```
|
|
|
|
### Discourse API endpoints to extract
|
|
|
|
| What | URL pattern | JSON fields |
|
|
|---|---|---|
|
|
| Latest topics | `/latest.json` | `topic_list.topics[].{id, title, slug, category_id, created_at, last_posted_at}` |
|
|
| Category index | `/categories.json` | `category_list.categories[].{id, name, slug}` |
|
|
| Single topic (with posts) | `/t/{id}.json` | `id, title, slug, category_id, post_stream.posts[].{username, cooked, created_at, post_number}` |
|
|
| Topic with page | `/t/{id}/{page}.json` | Same as above, paginated |
|
|
| User activity | `/u/{username}/activity.json` | `user_actions[]` |
|
|
| Search results | `/search.json?q=...` | `topics[]`, `posts[]` |
|
|
|
|
### What to do
|
|
|
|
1. **Open the `.chlsx` file** as a ZIP archive.
|
|
2. **List all XML files** inside (both flat and in subdirectories).
|
|
3. **For each XML file**, parse it and check if the request URL matches one of the Discourse endpoints above.
|
|
4. **Skip**: CSS, JS, images, font files, analytics, CDN assets, and any non-Discourse endpoint.
|
|
5. **Parse the JSON response body** from `<response><body>`.
|
|
6. **Create this folder structure** as output:
|
|
|
|
```
|
|
discourse-archive/
|
|
├── categories.json # All categories found
|
|
├── index.md # Master index (table of all topics with ID, title, date, category, URL)
|
|
├── topics/
|
|
│ ├── 123-your-topic-slug.md
|
|
│ ├── 456-another-topic.md
|
|
│ └── ...
|
|
```
|
|
|
|
### Markdown format per topic
|
|
|
|
Each topic file should be a clean Markdown document with YAML frontmatter:
|
|
|
|
```markdown
|
|
---
|
|
id: 123
|
|
title: "Your Topic Title"
|
|
slug: your-topic-slug
|
|
category: "Category Name"
|
|
created: 2026-01-15
|
|
updated: 2026-01-16
|
|
url: https://forum.example.com/t/your-topic-slug/123
|
|
---
|
|
|
|
# Your Topic Title
|
|
|
|
**Category**: Category Name
|
|
|
|
---
|
|
|
|
## Post 1 — @username1 (2026-01-15T10:30:00Z)
|
|
|
|
Post content here (HTML stripped, plain Markdown preferred).
|
|
|
|
---
|
|
|
|
## Post 2 — @username2 (2026-01-16T14:00:00Z)
|
|
|
|
More content.
|
|
|
|
---
|
|
```
|
|
|
|
### Deduplication rules
|
|
|
|
- If the same topic ID appears in multiple `.chlsx` files, keep the one with the most recent `last_posted_at`.
|
|
- If a session has page 2+ of a topic (`/t/123/2.json`), merge the posts with page 1.
|
|
- Never duplicate posts within a topic.
|
|
|
|
### What to do with the output
|
|
|
|
Place the resulting `discourse-archive/` folder in a location I can reference in future Copilot sessions. I will point Copilot to that folder when I need to search past Discourse conversations.
|
|
|
|
### Constraints
|
|
|
|
- Do not modify the original `.chlsx` file.
|
|
- Do not upload or send the extracted data anywhere — keep it local.
|
|
- If a topic has no readable content (deleted, access restricted), note it in the index but skip the full extraction.
|
|
- HTML in `cooked` fields should be converted to readable plain text / Markdown (Discourse stores posts as HTML in the JSON).
|
|
|
|
### First action
|
|
|
|
Ask me for:
|
|
1. The path to the `.chlsx` file (or files)
|
|
2. The Discourse base URL (so you can construct canonical topic URLs)
|
|
3. Where I want the output folder created
|