Corpus Building Wizard

Rather skip the form and see what it produces? Start with a student-scale example:

Small corpus via cloud API 75 Korean newspaper editorials, laptop only, ~$10
Small corpus via ALICE HPC 75 historical Korean newspapers, free compute, OCR-heavy
Small corpus on a local GPU 75 reports, 24 GB+ VRAM, no cloud cost

1 A route

API, ALICE/HPC, local GPU, or text extraction first.

2 An AI handoff

A Claude Code or Codex prompt already filled with your project details.

3 A corpus check

Quality gates before you scale up or cite the result.

Tell me about your corpus

Use rough answers. The starter kit is meant to begin an inspectable workflow, not decide the method for you.

1. How many pages total? 2. What source files are you starting from? 3. What compute do you have access to?

Compute access: ALICE, LUCDH, and the other options

ALICE

Leiden's HPC cluster. A40 (48 GB) and A100 (80 GB) GPUs, free for students and staff with a research case.

Request an account through Leiden IT; approval takes a few days. ALICE wiki →

LUCDH Digital Lab

Walk-in workstations in the Huizinga building (room 0.09). Two graphically-capable desktops for local GPU work, plus specialty kit like Transkribus imaging and VR.

Open 10–17 during semesters, no reservation needed for the general space. Email digital-lab@hum.leidenuniv.nl for specialty equipment. Digital Lab page →

Local files aren't retained between sessions; use external or cloud storage.

LUCDH AI Lab

Deeper compute for deep learning and HPC, granted on request.

Contact Jelena Prokic if your project outgrows a single workstation. AI Lab page →

Different SLURM cluster

Non-Leiden HPC: your host institution or a national cluster.

The alice-vllm-deploy skill covers the patterns; most port cleanly with minor edits to partition names and module loads.

Local GPU

A home machine with an NVIDIA GPU. Two realistic examples for student-scale OCR:

Running on 8 GB (RTX 4060, RTX 4060 Ti 8 GB): Qwen3-VL-8B at 4-bit is tight but workable; drop the input resolution a little to stay comfortable. Expect ~10 s/page.
Running on 8–12 GB (RTX 3060 12 GB, RTX 4070): Qwen3-VL-8B fits with room to breathe. ~8 s/page.

Check yours with nvidia-smi. 16 GB+ (RTX 4070 Ti Super, 4080) steps you up to a 13B model; 24 GB+ (RTX 4090) runs a 32B. Under 8 GB, take the API path instead.

Laptop only

Cloud API path. Works on any machine with internet.

A 750-page corpus costs roughly $8–15 on Claude or GPT, around $3 on Gemini Flash. You supply the API key.

4. What languages or scripts are in the documents? 5. What kind of documents? 6. What software will you use? (optional) 7. Resource constraints?