--- type: copilot-prompt status: ready target: github-copilot purpose: Parse Charles .chlsx sessions to create a searchable Discourse archive --- # Copilot Prompt — Charles Discourse Archiver Paste this into GitHub Copilot on the corporate device. --- ## Prompt You are helping me build a local searchable archive of a Discourse forum from captured Charles Proxy session files. ### Background I browse a Discourse forum in my browser while Charles Proxy records traffic. I save the session as a `.chlsx` file. Inside that file are all the HTTP request/response pairs for the pages I visited — including Discourse API calls that return structured JSON (topics, posts, categories, user profiles). I need you to extract only the Discourse content and organize it into a Markdown archive that: - Is searchable by an AI in future sessions - Preserves topic titles, post authors, dates, and content - Groups by category - Deduplicates topics that appear across multiple sessions ### File format: `.chlsx` `.chlsx` is a ZIP archive. Inside are numbered XML files (e.g. `00001.xml`, `00/00001.xml`). Each XML file represents one HTTP request/response pair with this structure: ```xml GET https://forum.example.com/t/123.json HTTP/1.1

...

200 HTTP/1.1

application/json; charset=utf-8

{"id": 123, "title": "Some Topic", "post_stream": {...}} 2026-01-15T10:30:00.000Z 450 ``` ### Discourse API endpoints to extract | What | URL pattern | JSON fields | |---|---|---| | Latest topics | `/latest.json` | `topic_list.topics[].{id, title, slug, category_id, created_at, last_posted_at}` | | Category index | `/categories.json` | `category_list.categories[].{id, name, slug}` | | Single topic (with posts) | `/t/{id}.json` | `id, title, slug, category_id, post_stream.posts[].{username, cooked, created_at, post_number}` | | Topic with page | `/t/{id}/{page}.json` | Same as above, paginated | | User activity | `/u/{username}/activity.json` | `user_actions[]` | | Search results | `/search.json?q=...` | `topics[]`, `posts[]` | ### What to do 1. **Open the `.chlsx` file** as a ZIP archive. 2. **List all XML files** inside (both flat and in subdirectories). 3. **For each XML file**, parse it and check if the request URL matches one of the Discourse endpoints above. 4. **Skip**: CSS, JS, images, font files, analytics, CDN assets, and any non-Discourse endpoint. 5. **Parse the JSON response body** from ``. 6. **Create this folder structure** as output: ``` discourse-archive/ ├── categories.json # All categories found ├── index.md # Master index (table of all topics with ID, title, date, category, URL) ├── topics/ │ ├── 123-your-topic-slug.md │ ├── 456-another-topic.md │ └── ... ``` ### Markdown format per topic Each topic file should be a clean Markdown document with YAML frontmatter: ```markdown --- id: 123 title: "Your Topic Title" slug: your-topic-slug category: "Category Name" created: 2026-01-15 updated: 2026-01-16 url: https://forum.example.com/t/your-topic-slug/123 --- # Your Topic Title **Category**: Category Name --- ## Post 1 — @username1 (2026-01-15T10:30:00Z) Post content here (HTML stripped, plain Markdown preferred). --- ## Post 2 — @username2 (2026-01-16T14:00:00Z) More content. --- ``` ### Deduplication rules - If the same topic ID appears in multiple `.chlsx` files, keep the one with the most recent `last_posted_at`. - If a session has page 2+ of a topic (`/t/123/2.json`), merge the posts with page 1. - Never duplicate posts within a topic. ### What to do with the output Place the resulting `discourse-archive/` folder in a location I can reference in future Copilot sessions. I will point Copilot to that folder when I need to search past Discourse conversations. ### Constraints - Do not modify the original `.chlsx` file. - Do not upload or send the extracted data anywhere — keep it local. - If a topic has no readable content (deleted, access restricted), note it in the index but skip the full extraction. - HTML in `cooked` fields should be converted to readable plain text / Markdown (Discourse stores posts as HTML in the JSON). ### First action Ask me for: 1. The path to the `.chlsx` file (or files) 2. The Discourse base URL (so you can construct canonical topic URLs) 3. Where I want the output folder created