feat: add tooling and documentation for archiving Discourse content via Charles Proxy .chlsx sessions.

2026-05-19 14:28:45 -06:00
parent f726814811
commit 73166b585f
8 changed files with 322 additions and 3 deletions
--- a/ai/discourse-archive/charles-session-format.md
+++ b/ai/discourse-archive/charles-session-format.md
@@ -0,0 +1,104 @@
+# Charles Session File Format (.chlsx)
+
+Reference for AI agents that need to parse Charles Proxy session files.
+
+---
+
+## File Format
+
+`.chlsx` is a **ZIP archive** containing numbered XML files, each representing one HTTP request/response pair.
+
+### Structure inside the ZIP
+
+```
+session.chlsx
+├── 00001.xml
+├── 00002.xml
+├── 00003.xml
+├── ...
+└── 00/
+    ├── 00001.xml
+    ├── 00002.xml
+    └── ...
+```
+
+Files may be flat at the root or grouped in two-digit subdirectories (`00/`, `01/`, etc.) depending on session size.
+
+### XML Structure Per File
+
+Each XML file contains:
+
+- **Request**: method, URL, protocol, headers, body
+- **Response**: status, protocol, headers, body
+- **Timing**: start time, duration
+
+Key XML elements:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<session>
+  <request>
+    <method>GET</method>
+    <url>https://discourse.example.com/t/123.json</url>
+    <protocol>HTTP/1.1</protocol>
+    <header name="Accept">application/json</header>
+    <header name="Cookie">_t=abc123</header>
+    <body></body>
+  </request>
+  <response>
+    <status>200</status>
+    <protocol>HTTP/1.1</protocol>
+    <header name="Content-Type">application/json; charset=utf-8</header>
+    <body>{"id": 123, "title": "...", "post_stream": {...}}</body>
+  </response>
+  <timing>
+    <start>2026-01-15T10:30:00.000Z</start>
+    <duration>450</duration>
+  </timing>
+</session>
+```
+
+### .chls vs .chlsx vs .chlsj
+
+| Extension | Format | Notes |
+|---|---|---|
+| `.chls` | Binary | Legacy format, harder to parse |
+| `.chlsx` | ZIP + XML | **Prefer this**. Most common modern format |
+| `.chlsj` | JSON | Newer, less common; each session is one JSON file with an array of request/response objects |
+
+**Recommendation**: Configure Charles to save as `.chlsx` (File → Save Session As... → choose `.chlsx`).
+
+---
+
+## Discourse API Endpoints to Look For
+
+These are the endpoints worth extracting from a Charles session:
+
+| Purpose | URL pattern | Parsing target |
+|---|---|---|
+| Topic feed | `/latest.json` | `topic_list.topics[]` |
+| Category topics | `/c/{slug}.json` | `topic_list.topics[]` |
+| Single topic | `/t/{id}.json` | The full topic with posts |
+| Posts in topic | `/t/{id}/{page}.json` | Paginated posts |
+| Search | `/search.json?q=...` | `topics[]`, `posts[]` |
+| User activity | `/u/{username}/activity.json` | User posts/topics |
+
+---
+
+## Extraction Strategy for AI
+
+1. **Open the `.chlsx` as a ZIP** (it is not encrypted)
+2. **Iterate over all XML files** inside
+3. For each XML, check if the request URL matches a Discourse API endpoint
+4. Extract the JSON response body from `<response><body>`
+5. Parse the JSON and convert to Markdown
+6. Organize by topic ID + title for easy search
+
+---
+
+## Common Pitfalls
+
+- Some responses are paginated (`/t/{id}.json?page=1`). Collect all pages for completeness.
+- Binary responses (images, JS bundles) should be skipped.
+- The same topic may appear multiple times in different Charles sessions; deduplicate by topic ID + last updated timestamp.
+- Session cookies captured in Charles will be expired by the time the AI reads them; only the response data matters.