feat: add tooling and documentation for archiving Discourse content via Charles Proxy .chlsx sessions.
This commit is contained in:
104
ai/discourse-archive/charles-session-format.md
Normal file
104
ai/discourse-archive/charles-session-format.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Charles Session File Format (.chlsx)
|
||||
|
||||
Reference for AI agents that need to parse Charles Proxy session files.
|
||||
|
||||
---
|
||||
|
||||
## File Format
|
||||
|
||||
`.chlsx` is a **ZIP archive** containing numbered XML files, each representing one HTTP request/response pair.
|
||||
|
||||
### Structure inside the ZIP
|
||||
|
||||
```
|
||||
session.chlsx
|
||||
├── 00001.xml
|
||||
├── 00002.xml
|
||||
├── 00003.xml
|
||||
├── ...
|
||||
└── 00/
|
||||
├── 00001.xml
|
||||
├── 00002.xml
|
||||
└── ...
|
||||
```
|
||||
|
||||
Files may be flat at the root or grouped in two-digit subdirectories (`00/`, `01/`, etc.) depending on session size.
|
||||
|
||||
### XML Structure Per File
|
||||
|
||||
Each XML file contains:
|
||||
|
||||
- **Request**: method, URL, protocol, headers, body
|
||||
- **Response**: status, protocol, headers, body
|
||||
- **Timing**: start time, duration
|
||||
|
||||
Key XML elements:
|
||||
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<session>
|
||||
<request>
|
||||
<method>GET</method>
|
||||
<url>https://discourse.example.com/t/123.json</url>
|
||||
<protocol>HTTP/1.1</protocol>
|
||||
<header name="Accept">application/json</header>
|
||||
<header name="Cookie">_t=abc123</header>
|
||||
<body></body>
|
||||
</request>
|
||||
<response>
|
||||
<status>200</status>
|
||||
<protocol>HTTP/1.1</protocol>
|
||||
<header name="Content-Type">application/json; charset=utf-8</header>
|
||||
<body>{"id": 123, "title": "...", "post_stream": {...}}</body>
|
||||
</response>
|
||||
<timing>
|
||||
<start>2026-01-15T10:30:00.000Z</start>
|
||||
<duration>450</duration>
|
||||
</timing>
|
||||
</session>
|
||||
```
|
||||
|
||||
### .chls vs .chlsx vs .chlsj
|
||||
|
||||
| Extension | Format | Notes |
|
||||
|---|---|---|
|
||||
| `.chls` | Binary | Legacy format, harder to parse |
|
||||
| `.chlsx` | ZIP + XML | **Prefer this**. Most common modern format |
|
||||
| `.chlsj` | JSON | Newer, less common; each session is one JSON file with an array of request/response objects |
|
||||
|
||||
**Recommendation**: Configure Charles to save as `.chlsx` (File → Save Session As... → choose `.chlsx`).
|
||||
|
||||
---
|
||||
|
||||
## Discourse API Endpoints to Look For
|
||||
|
||||
These are the endpoints worth extracting from a Charles session:
|
||||
|
||||
| Purpose | URL pattern | Parsing target |
|
||||
|---|---|---|
|
||||
| Topic feed | `/latest.json` | `topic_list.topics[]` |
|
||||
| Category topics | `/c/{slug}.json` | `topic_list.topics[]` |
|
||||
| Single topic | `/t/{id}.json` | The full topic with posts |
|
||||
| Posts in topic | `/t/{id}/{page}.json` | Paginated posts |
|
||||
| Search | `/search.json?q=...` | `topics[]`, `posts[]` |
|
||||
| User activity | `/u/{username}/activity.json` | User posts/topics |
|
||||
|
||||
---
|
||||
|
||||
## Extraction Strategy for AI
|
||||
|
||||
1. **Open the `.chlsx` as a ZIP** (it is not encrypted)
|
||||
2. **Iterate over all XML files** inside
|
||||
3. For each XML, check if the request URL matches a Discourse API endpoint
|
||||
4. Extract the JSON response body from `<response><body>`
|
||||
5. Parse the JSON and convert to Markdown
|
||||
6. Organize by topic ID + title for easy search
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
- Some responses are paginated (`/t/{id}.json?page=1`). Collect all pages for completeness.
|
||||
- Binary responses (images, JS bundles) should be skipped.
|
||||
- The same topic may appear multiple times in different Charles sessions; deduplicate by topic ID + last updated timestamp.
|
||||
- Session cookies captured in Charles will be expired by the time the AI reads them; only the response data matters.
|
||||
Reference in New Issue
Block a user