Scenario: Small corpus on a local GPU
Who this is for: a student or staff member with a home machine and a consumer NVIDIA GPU. 8–12 GB VRAM is plenty for small-to-medium student projects with an 8B vision-language model. No cloud costs, no HPC queue.
The situation
A grad student studying parliamentary speeches has 75 English-language committee reports (~750 pages) from a government portal. Hardware: RTX 3060 with 12 GB VRAM. Analysis target: spaCy for named-entity extraction, then pandas for downstream counts.
Wizard answers
| Question | Answer |
|---|---|
| How many pages? | 501–1,000 |
| What compute? | Local GPU with less than 16 GB |
| Languages/scripts? | Latin only |
| Document type? | Reports |
| What software? | Python |
| Resource constraints? | No budget; fast iteration wanted |
Route: Local GPU path. hf-transformers-ocr covers BnB NF4 loading, inference patterns, cache hygiene.
Starter kit
Recommended path: Local GPU
Corpus scale: roughly 750 pages
Est. cost: free (your electricity)
Est. time: ~2 min model load + ~100 min sequential OCR + ~10 min assembly
Read first
corpus-from-pdfs— end-to-end framingcorpus-metadata-design— schema for pandas / spaCyhf-transformers-ocr— BnB NF4 loading, inference, cache hygiene
Paste into Claude Code / Codex
I have 75 English committee reports (roughly 750 pages) and a local RTX 3060 with 12 GB VRAM. I want to use Qwen3-VL-8B-Instruct with BitsAndBytes NF4 quantization. Analysis target is spaCy in Python. Walk me through the
corpus-from-pdfspipeline. Help me set up the model loading script, run OCR, and assemble the corpus. Keep CUDA memory tidy — clear cache every 5–10 pages.
Or launch a fresh session directly
claude "I have 75 English committee reports (roughly 750 pages) and a local RTX 3060 with 12 GB VRAM. I want to use Qwen3-VL-8B-Instruct with BitsAndBytes NF4 quantization. Analysis target is spaCy in Python. Walk me through the corpus-from-pdfs pipeline."
Files to copy
templates/prompts.py.template→prompts.py(Pattern A, English, no character enumeration needed)
Commands (pre-filled)
python3 scripts/inventory_builder.py --source-dir ./reports --output manifest.csv
# OCR step — HF Transformers client built with Claude Code:
# loads Qwen3-VL-8B with BnB NF4, processes each page,
# writes ocr_output/<pdf_id>/results_raw.json
python3 scripts/corpus_assembler.py --ocr-dir ocr_output --manifest manifest.csv --output corpus/
Walkthrough
Note: runs on any machine with a 8 GB+ CUDA GPU. At ~8 s/page for an 8B model on a 12 GB consumer card, budget roughly 100 minutes for 750 pages.
Step 1: Inventory (local)
python3 scripts/inventory_builder.py --source-dir ./reports --output manifest.csv
Found 75 PDFs under /home/you/reports
Manifest: manifest.csv
Total PDFs: 75
Total pages: 748
Step 2: Fill prompts.py
Pattern A, English. No character enumeration needed — English has no tricky diacritics.
PROMPTS = {
"english": (
"Extract all text from this document image. "
"Transcribe every character exactly as it appears. "
"Output in markdown format preserving headings, paragraphs, "
"footnotes, and tables. "
"If the page has no text at all, output only: [NO_TEXT]. "
"Do not translate or interpret the text."
),
}
LANGUAGE_PROMPT_MAP = {"english": "english"}
Step 3: Build the HF Transformers client (inside Claude Code)
The student pastes the wizard’s prompt. Claude Code reads hf-transformers-ocr and helps write a ~100-line hf_ocr.py script. Key patterns the agent applies from the skill:
AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct", ...)withmin_pixels/max_pixelssetAutoModelForImageTextToText.from_pretrained(...)withBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")attn_implementation="flash_attention_2"with fall-back to"eager"enable_thinking=Falseanddo_sample=Falseon generationtorch.cuda.empty_cache()every 5–10 pages (12 GB fills up fast)- Native image extraction via PyMuPDF’s
page.get_images(); rasterize for VLM at 200 DPI - Resumable: skip PDFs whose
results_raw.jsonalready exists
A trimmed version of the inference loop:
def ocr_page(model, processor, image_path, prompt, max_new_tokens=8192):
messages = [{"role": "user", "content": [
{"type": "image", "image": str(image_path)},
{"type": "text", "text": prompt},
]}]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt", enable_thinking=False,
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
trimmed = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
return processor.batch_decode(trimmed, skip_special_tokens=True)[0].strip()
Step 4: Run
python3 hf_ocr.py --manifest manifest.csv --output ocr_output/
Expected timing on RTX 3060 (12 GB):
Loading Qwen/Qwen3-VL-8B-Instruct (NF4)... 1m 48s
Processing aggr_committee_2019_001.pdf (12 pages)...
page 0: 8s page 1: 7s page 2: 9s ...
[cache cleared]
Processing aggr_committee_2019_002.pdf (9 pages)...
...
~8 s/page × 748 pages ≈ 100 min. Laptop can sleep during this; the run continues. If the student closes their terminal, they lose the run unless they used tmux or nohup — Claude Code will have suggested one of those.
Step 5: Assemble
python3 scripts/corpus_assembler.py \
--ocr-dir ocr_output \
--manifest manifest.csv \
--output corpus/
corpus: 748 pages total, 745 with text
JSON: corpus/corpus.json
CSV : corpus/corpus.csv (745 rows)
Text density: 99.6%
Step 6: Analyze
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_trf")
df = pd.read_csv("corpus/corpus.csv")
entities = []
for _, row in df.iterrows():
for ent in nlp(row["text"]).ents:
entities.append({"pdf_id": row["pdf_id"], "page": row["page"],
"text": ent.text, "label": ent.label_})
ents_df = pd.DataFrame(entities)
What would change at a different scale or hardware?
- Less than 8 GB VRAM: even the smallest VLMs struggle at useful speed. Use the API path instead (see
small_api.html). - 16 GB+ (RTX 4070 Ti Super, 4080): step up to a 13B model for better accuracy on dense or stylized text. Still fits at 4-bit.
- 24 GB+ (RTX 4090): the 32B tier opens up — best accuracy for consumer-grade work, especially on multi-column or historical text.
- MoE model wanted (Qwen3.5-35B-A3B): BnB NF4 is incompatible with MoE on transformers v5. Use vLLM + GPTQ via ALICE instead —
hf-transformers-ocrspells this out. - Larger corpus (>5k pages): runtime scales linearly. ~11 hours for 5k pages on a 3060. Consider whether the API is actually simpler at that point.
- Non-Latin / historical script: add character enumeration to Pattern A as in the ALICE scenario.