fidelity-ai-workspace/ai/discourse-archive/copilot-prompt-charles-discourse-archiver.md

---
type: copilot-prompt
status: ready
target: github-copilot
purpose: Parse Charles .chlsx sessions to create a searchable Discourse archive
---

# Copilot Prompt — Charles Discourse Archiver

Paste this into GitHub Copilot on the corporate device.

---

## Prompt

You are helping me build a local searchable archive of a Discourse forum from captured Charles Proxy session files.

### Background

I browse a Discourse forum in my browser while Charles Proxy records traffic. I save the session as a `.chlsx` file. Inside that file are all the HTTP request/response pairs for the pages I visited — including Discourse API calls that return structured JSON (topics, posts, categories, user profiles).

I need you to extract only the Discourse content and organize it into a Markdown archive that:
- Is searchable by an AI in future sessions
- Preserves topic titles, post authors, dates, and content
- Groups by category
- Deduplicates topics that appear across multiple sessions

### File format: `.chlsx`

`.chlsx` is a ZIP archive. Inside are numbered XML files (e.g. `00001.xml`, `00/00001.xml`). Each XML file represents one HTTP request/response pair with this structure:

```xml
<session>
  <request>
    <method>GET</method>
    <url>https://forum.example.com/t/123.json</url>
    <protocol>HTTP/1.1</protocol>
    <header name="Cookie">...</header>
    <body></body>
  </request>
  <response>
    <status>200</status>
    <protocol>HTTP/1.1</protocol>
    <header name="Content-Type">application/json; charset=utf-8</header>
    <body>{"id": 123, "title": "Some Topic", "post_stream": {...}}</body>
  </response>
  <timing>
    <start>2026-01-15T10:30:00.000Z</start>
    <duration>450</duration>
  </timing>
</session>
```

### Discourse API endpoints to extract

| What | URL pattern | JSON fields |
|---|---|---|
| Latest topics | `/latest.json` | `topic_list.topics[].{id, title, slug, category_id, created_at, last_posted_at}` |
| Category index | `/categories.json` | `category_list.categories[].{id, name, slug}` |
| Single topic (with posts) | `/t/{id}.json` | `id, title, slug, category_id, post_stream.posts[].{username, cooked, created_at, post_number}` |
| Topic with page | `/t/{id}/{page}.json` | Same as above, paginated |
| User activity | `/u/{username}/activity.json` | `user_actions[]` |
| Search results | `/search.json?q=...` | `topics[]`, `posts[]` |

### What to do

1. **Open the `.chlsx` file** as a ZIP archive.
2. **List all XML files** inside (both flat and in subdirectories).
3. **For each XML file**, parse it and check if the request URL matches one of the Discourse endpoints above.
4. **Skip**: CSS, JS, images, font files, analytics, CDN assets, and any non-Discourse endpoint.
5. **Parse the JSON response body** from `<response><body>`.
6. **Create this folder structure** as output:

```
discourse-archive/
├── categories.json          # All categories found
├── index.md                 # Master index (table of all topics with ID, title, date, category, URL)
├── topics/
│   ├── 123-your-topic-slug.md
│   ├── 456-another-topic.md
│   └── ...
```

### Markdown format per topic

Each topic file should be a clean Markdown document with YAML frontmatter:

```markdown
---
id: 123
title: "Your Topic Title"
slug: your-topic-slug
category: "Category Name"
created: 2026-01-15
updated: 2026-01-16
url: https://forum.example.com/t/your-topic-slug/123
---

# Your Topic Title

**Category**: Category Name

---

## Post 1 — @username1 (2026-01-15T10:30:00Z)

Post content here (HTML stripped, plain Markdown preferred).

---

## Post 2 — @username2 (2026-01-16T14:00:00Z)

More content.

---
```

### Deduplication rules

- If the same topic ID appears in multiple `.chlsx` files, keep the one with the most recent `last_posted_at`.
- If a session has page 2+ of a topic (`/t/123/2.json`), merge the posts with page 1.
- Never duplicate posts within a topic.

### What to do with the output

Place the resulting `discourse-archive/` folder in a location I can reference in future Copilot sessions. I will point Copilot to that folder when I need to search past Discourse conversations.

### Constraints

- Do not modify the original `.chlsx` file.
- Do not upload or send the extracted data anywhere — keep it local.
- If a topic has no readable content (deleted, access restricted), note it in the index but skip the full extraction.
- HTML in `cooked` fields should be converted to readable plain text / Markdown (Discourse stores posts as HTML in the JSON).

### First action

Ask me for:
1. The path to the `.chlsx` file (or files)
2. The Discourse base URL (so you can construct canonical topic URLs)
3. Where I want the output folder created