← Back to Building a Corpus View on GitHub

Companion resource

Corpus Building Wizard

Answer seven questions. Get a starter kit for Claude Code or Codex that turns a folder of source files into an analysis-ready text corpus.

Steven Denney, Leiden University. Companion to the Building a Corpus primer on Thesis & Research Supervision.

Before you start: assumptions and useful links

This is the computational path for corpus building, not a beginner's introduction. It assumes you've already thought about what corpus you want and why. It now routes both OCR-heavy work and non-PDF source extraction through the relevant skills.

For the conceptual primer (what a corpus is, planning, selection criteria, ethics, organizing files), read Building a Corpus on Thesis & Research Supervision.

New to Claude Code or Codex? The README has a curated resource list: Anthropic and OpenAI docs plus practitioner guides.

Rather skip the form and see what it produces? Start with a student-scale example:

1 A route

API, ALICE/HPC, local GPU, or text extraction first.

2 An AI handoff

A Claude Code or Codex prompt already filled with your project details.

3 A corpus check

Quality gates before you scale up or cite the result.

Tell me about your corpus

Use rough answers. The starter kit is meant to begin an inspectable workflow, not decide the method for you.

Compute access: ALICE, LUCDH, and the other options

ALICE

Leiden's HPC cluster. A40 (48 GB) and A100 (80 GB) GPUs, free for students and staff with a research case.

Request an account through Leiden IT; approval takes a few days. ALICE wiki →

LUCDH Digital Lab

Walk-in workstations in the Huizinga building (room 0.09). Two graphically-capable desktops for local GPU work, plus specialty kit like Transkribus imaging and VR.

Open 10–17 during semesters, no reservation needed for the general space. Email digital-lab@hum.leidenuniv.nl for specialty equipment. Digital Lab page →

Local files aren't retained between sessions; use external or cloud storage.

LUCDH AI Lab

Deeper compute for deep learning and HPC, granted on request.

Contact Jelena Prokic if your project outgrows a single workstation. AI Lab page →

Different SLURM cluster

Non-Leiden HPC: your host institution or a national cluster.

The alice-vllm-deploy skill covers the patterns; most port cleanly with minor edits to partition names and module loads.

Local GPU

A home machine with an NVIDIA GPU. Two realistic examples for student-scale OCR:

  • Running on 8 GB (RTX 4060, RTX 4060 Ti 8 GB): Qwen3-VL-8B at 4-bit is tight but workable; drop the input resolution a little to stay comfortable. Expect ~10 s/page.
  • Running on 8–12 GB (RTX 3060 12 GB, RTX 4070): Qwen3-VL-8B fits with room to breathe. ~8 s/page.

Check yours with nvidia-smi. 16 GB+ (RTX 4070 Ti Super, 4080) steps you up to a 13B model; 24 GB+ (RTX 4090) runs a 32B. Under 8 GB, take the API path instead.

Laptop only

Cloud API path. Works on any machine with internet.

A 750-page corpus costs roughly $8–15 on Claude or GPT, around $3 on Gemini Flash. You supply the API key.