Files
fidelity-ai-workspace/workspaces/fidelity/inbox/discourse-archive/copilot-prompt-charles-discourse-archiver.md
david.delagneau 1ad707373a Add daily logs and templates for project fidelity
- Created daily log entries for May 13, 14, 18, 19, 20, and 21, capturing work done, findings, and next steps.
- Established a daily logs index for easy navigation of daily notes.
- Developed templates for daily logs, decisions, meeting notes, people, systems, and work items to standardize documentation.
- Introduced base files for filtering and displaying various types of project knowledge, including daily notes, decisions, people, systems, work items, and workstreams.
- Added maps for current work, fidelity apps, and fidelity domain to enhance project navigation and context.
2026-05-21 12:28:07 -06:00

4.6 KiB

type, status, target, purpose
type status target purpose
copilot-prompt ready github-copilot Parse Charles .chlsx sessions to create a searchable Discourse archive

Copilot Prompt — Charles Discourse Archiver

Paste this into GitHub Copilot on the corporate device.


Prompt

You are helping me build a local searchable archive of a Discourse forum from captured Charles Proxy session files.

Background

I browse a Discourse forum in my browser while Charles Proxy records traffic. I save the session as a .chlsx file. Inside that file are all the HTTP request/response pairs for the pages I visited — including Discourse API calls that return structured JSON (topics, posts, categories, user profiles).

I need you to extract only the Discourse content and organize it into a Markdown archive that:

  • Is searchable by an AI in future sessions
  • Preserves topic titles, post authors, dates, and content
  • Groups by category
  • Deduplicates topics that appear across multiple sessions

File format: .chlsx

.chlsx is a ZIP archive. Inside are numbered XML files (e.g. 00001.xml, 00/00001.xml). Each XML file represents one HTTP request/response pair with this structure:

<session>
  <request>
    <method>GET</method>
    <url>https://forum.example.com/t/123.json</url>
    <protocol>HTTP/1.1</protocol>
    <header name="Cookie">...</header>
    <body></body>
  </request>
  <response>
    <status>200</status>
    <protocol>HTTP/1.1</protocol>
    <header name="Content-Type">application/json; charset=utf-8</header>
    <body>{"id": 123, "title": "Some Topic", "post_stream": {...}}</body>
  </response>
  <timing>
    <start>2026-01-15T10:30:00.000Z</start>
    <duration>450</duration>
  </timing>
</session>

Discourse API endpoints to extract

What URL pattern JSON fields
Latest topics /latest.json topic_list.topics[].{id, title, slug, category_id, created_at, last_posted_at}
Category index /categories.json category_list.categories[].{id, name, slug}
Single topic (with posts) /t/{id}.json id, title, slug, category_id, post_stream.posts[].{username, cooked, created_at, post_number}
Topic with page /t/{id}/{page}.json Same as above, paginated
User activity /u/{username}/activity.json user_actions[]
Search results /search.json?q=... topics[], posts[]

What to do

  1. Open the .chlsx file as a ZIP archive.
  2. List all XML files inside (both flat and in subdirectories).
  3. For each XML file, parse it and check if the request URL matches one of the Discourse endpoints above.
  4. Skip: CSS, JS, images, font files, analytics, CDN assets, and any non-Discourse endpoint.
  5. Parse the JSON response body from <response><body>.
  6. Create this folder structure as output:
discourse-archive/
├── categories.json          # All categories found
├── index.md                 # Master index (table of all topics with ID, title, date, category, URL)
├── topics/
│   ├── 123-your-topic-slug.md
│   ├── 456-another-topic.md
│   └── ...

Markdown format per topic

Each topic file should be a clean Markdown document with YAML frontmatter:

---
id: 123
title: "Your Topic Title"
slug: your-topic-slug
category: "Category Name"
created: 2026-01-15
updated: 2026-01-16
url: https://forum.example.com/t/your-topic-slug/123
---

# Your Topic Title

**Category**: Category Name

---

## Post 1 — @username1 (2026-01-15T10:30:00Z)

Post content here (HTML stripped, plain Markdown preferred).

---

## Post 2 — @username2 (2026-01-16T14:00:00Z)

More content.

---

Deduplication rules

  • If the same topic ID appears in multiple .chlsx files, keep the one with the most recent last_posted_at.
  • If a session has page 2+ of a topic (/t/123/2.json), merge the posts with page 1.
  • Never duplicate posts within a topic.

What to do with the output

Place the resulting discourse-archive/ folder in a location I can reference in future Copilot sessions. I will point Copilot to that folder when I need to search past Discourse conversations.

Constraints

  • Do not modify the original .chlsx file.
  • Do not upload or send the extracted data anywhere — keep it local.
  • If a topic has no readable content (deleted, access restricted), note it in the index but skip the full extraction.
  • HTML in cooked fields should be converted to readable plain text / Markdown (Discourse stores posts as HTML in the JSON).

First action

Ask me for:

  1. The path to the .chlsx file (or files)
  2. The Discourse base URL (so you can construct canonical topic URLs)
  3. Where I want the output folder created