# Xu Ben Source Map v0.1

This map defines the source layers for the Xu Ben / civilisation-diagnostician turn project. It is a planning document, not a completed bibliography.

## Source Layers

| Source | Layer | Known / Expected Range | Current Status | Metadata Quality | Crawl Suitability | Research Function | Limitation |
|---|---|---:|---|---|---|---|---|
| 爱思想徐贲专栏 | Mainland authorised/reposted archive | 2000s-2026 | **89 body-level records** (all genres); 457 index links (snapshot 2026-06-16) | Good article URLs, titles, categories, dates after fetch | Moderate; rate-limited, must crawl slowly | Mainland-circulable long-run archive; first test of institutional vs subjectivity vocabulary | Platform/archive selection bias; not all Xu Ben writing; body pages contain template pollution |
| 财新博客 | Mainland blog/archive | 2011-2020 | **290 posts** confirmed (xuben.blog.caixin.com, surveyed 2026-06-17). Peak: 2012 (89), 2014 (65). Last post ~2020 | Good dates/titles/URLs | Likely good; public blog | Older mainland-facing material; blog genre differs from Aisixiang academic reprints | Blog stopped ~2020; does not cover AI-era period |
| 新京报书评周刊 | Mainland media/book-review layer | 2019-2026, especially 2025-2026 AI cluster | Known recent Xu Ben AI article cluster | Good dates/titles if pages accessible | Low-to-moderate; media pages may change | Media-publishing matrix; how book-review media packages AI/humanism/civilisation questions | Not author archive; heavily shaped by interview/promotion format |
| 澎湃思想市场 | Mainland media/book-review layer | 2010s-2026 | Small number confirmed (surveyed 2026-06-17): 数码时代访谈, 世界读书日专访, 记忆见证文 | Good dates/titles | Low-to-moderate | Compare media framing of public intellectuals; cohort-level discourse | Few articles; interview/promotion format |
| 上海书评 | Mainland / semi-mainland review layer | 2010s-2026 | Not mapped | Likely good | Unknown | Comparable book-review/interview corpus | May have paywall or site search limits |
| 端传媒 | Overseas/lower-censorship layer | 2015-2021+ | **Not found** in web search (surveyed 2026-06-17); likely paywalled/not indexed | Unknown | Low; paywall | Test whether institutional vocabulary persists outside mainland | Paywalled; may require manual access |
| RFA / VOA interviews or essays | Overseas/lower-censorship layer | scattered | **Not found** in web search (surveyed 2026-06-17) | Good dates if found | Good for metadata | Lower-censorship political vocabulary comparison | No confirmed Xu Ben content |
| 纵览中国 | Overseas/lower-censorship layer | unknown | **Not confirmed** (surveyed 2026-06-17); Wikipedia lists Xu Ben but site search inconclusive | Unknown | Unknown | Possible archive of institution/political commentary | Need manual site search |
| 独立中文笔会 | Overseas/lower-censorship layer | unknown | Not mapped | Unknown | Unknown | Possible literary/public-intellectual archive | May contain reposts rather than original publication |
| 中国数字时代 | Censorship archive layer | 2010s-2026 | **~520 articles** (surveyed 2026-06-17) | Good enough for archive metadata | Metadata first; body only if needed | **Largest known Xu Ben corpus outside Aisixiang**; tracks deleted/sensitive discourse | Selection bias by censorship/sensitivity; many may be reposts. **~520 articles** (52 pages at CDT tag page, surveyed 2026-06-17). Direct fetch 403; use web search |
| 中国数字空间 / 404文库 | Censorship archive layer | scattered | Not mapped | Good if entries exist | Metadata first | Identify texts or themes that survive as censorship archive records | Not directly comparable with normal platforms |
| Book titles, TOCs, interviews, publisher pages | Publication layer | 2005-2026 | Partially known from seed note | Good for titles/dates | Manual / metadata only | Track self-positioning: researcher -> educator -> civilisation diagnostician | Cannot substitute for text-level analysis |
| Early academic papers / English monographs | Academic baseline | 1990s-2000s | Not mapped | Good if library records available | Manual / metadata only | Establish original problem-space and method baseline | Different language/genre from Chinese public writing |

## How To Use This Map

Do not combine sources into one pooled corpus until each layer's function is clear.

Use source layers as comparisons:

- **Aisixiang vs overseas sources**: tests platform filtering vs author-level change.
- **Aisixiang vs CDT/404**: tests whether institutional language disappears from mainland visibility and survives as censorship archive.
- **Media/book-review layer vs author archives**: tests whether "civilisation diagnosis" is platform-amplified by interview and promotion formats.
- **Books vs essays/interviews**: tests whether the author's self-positioning changed in long-form publication, not just media output.

## Survey Log

- 2026-06-17 (cc): web search survey of all platforms. Results updated in table above.
- Key finding: **CDT (~520 articles) and 财新博客 (290 posts)** are the two largest unmapped sources. Aisixiang (457 index, 89 body records) is the only completed first-round source.
- Overseas sources (端传媒, RFA, 纵览中国) returned no confirmed results via web search. May require manual site searches or are genuinely empty.
- Note (from 猪猪): `site:` web search results are often incomplete compared to actual site content; platform-native search functions are more reliable where available.

## Next Mapping Tasks

1. ~~Search each source for "徐贲" and record result counts~~ — done (2026-06-17). See survey log above.
2. **Priority 1**: 财新博客 (290 posts, 2011-2020) — design index scraper; same-layer comparison with Aisixiang. Blog format = shorter, more frequent, different genre.
3. **Priority 2**: CDT (~520 articles) — cannot crawl directly (403); need alternative access strategy (manual sampling, web search extraction, or CDT API if available). This is the cross-platform comparison core.
4. **Lower priority**: 澎湃, 新京报 — small number of articles; useful for media-framing analysis but not for corpus-level comparison.
5. **Deferred**: 端传媒, RFA, 纵览中国 — no confirmed content found; revisit if manual search reveals material.
6. Record URL patterns and whether pages expose stable dates.
7. Mark whether each source is original publication, authorised repost, interview, censorship archive, or book metadata.
