Data Flow: PDF to Results

This document traces a reference’s journey from PDF file to final validation result.

Pipeline Overview

PDF file
  │
  ▼
┌─────────────────┐
│  File Dispatch   │  hallucinator-ingest
│  (PDF/BBL/BIB/  │  Detects file type, extracts from archives
│   archive)       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Text Extraction  │  hallucinator-parsing + hallucinator-pdf-mupdf
│  (PdfBackend)    │  MuPDF extracts raw text with ligature expansion
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Section Detection│  hallucinator-parsing/src/section.rs
│                  │  Locates "References" / "Bibliography" header
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Segmentation   │  hallucinator-parsing/src/section.rs
│                  │  Splits section into individual references
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Title/Author    │  hallucinator-parsing/src/title.rs, authors.rs
│  Extraction     │  Parses title, authors, DOI, arXiv ID per ref
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Skip Filtering  │  hallucinator-parsing/src/extractor.rs
│                  │  Removes URL-only and short-title refs
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Validation    │  hallucinator-core (pool, orchestrator, db/*)
│   Pool          │  Concurrent DB queries with early exit
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Result Assembly │  hallucinator-core/src/pool.rs
│                  │  Merge local+remote results, retraction check
└────────┬────────┘
         │
         ▼
  Vec<ValidationResult>

Stage 1: File Dispatch

Crate: hallucinator-ingest

The ingest crate handles file type detection and archive extraction:

PDF files — Passed to the PDF extraction pipeline
BBL/BIB files — Parsed by hallucinator-bbl (LaTeX bibliography entries)
Archives (.tar.gz, .zip) — Extracted streaming via ArchiveIterator, each contained PDF processed independently
Size limits — Configurable max_archive_size_mb to prevent resource exhaustion

Stage 2: Text Extraction

Crate: hallucinator-parsing + hallucinator-pdf-mupdf

The PdfBackend trait abstracts text extraction. The MuPDF backend:

Opens the PDF and iterates page-by-page
Extracts raw text blocks
Expands ligatures (ﬁ → fi, ﬂ → fl, ﬀ → ff, etc.)
Fixes hyphenation — distinguishes syllable breaks from compound words using a suffix heuristic

Stage 3: Section Detection

File: hallucinator-parsing/src/section.rs

Locates the references section by scanning for header patterns:

Primary: References, Bibliography, REFERENCES, BIBLIOGRAPHY
End markers: Appendix, Acknowledgments, Supplementary, Author Contributions

If no header is found, falls back to using the last 30% of the document text.

The section text between the header and the first end-marker (or EOF) is extracted.

Stage 4: Reference Segmentation

File: hallucinator-parsing/src/section.rs

Individual references are split using priority-ordered strategies:

Priority	Strategy	Pattern	Example
1	IEEE	`[1]`, `[2]`, …	`[1] A. Author, "Title..."`
2	Numbered	`1.`, `2.`, …	`1. Author, Title...`
3	ML author-based	Full names / initials	`Author, A. B. (2023). Title...`
4	Springer/Nature	Uppercase + `(YYYY)`	`AUTHOR, A. Title. J. (2023)`
5	Fallback	Double newline	Two blank lines between refs

The system tries each strategy and picks the one that produces the most valid segments. For IEEE and numbered styles, a sequential check ensures numbering is contiguous.

Stage 5: Title and Author Extraction

Files: hallucinator-parsing/src/title.rs, authors.rs, identifiers.rs

For each segmented reference:

DOI extraction — Regex: /10\.\d+/[^\s]+/
arXiv ID extraction — Regex for arXiv:YYMM.NNNNN patterns
Title extraction — Two strategies tried in order:
- Quoted strings (e.g., "Title Here")
- Capitalized word sequences between author and venue patterns
Author extraction — Format-specific parsing for IEEE, ACM, USENIX, AAAI, NeurIPS styles
Em-dash handling — ——— means “same authors as previous reference”

Stage 6: Skip Filtering

File: hallucinator-parsing/src/extractor.rs

References are skipped (not validated) if:

URL-only — The reference is just a URL to a non-academic site (GitHub, docs, etc.)
Short title — Title has fewer than 5 words (prone to false matches), unless a DOI or arXiv ID is present
No title — No title could be extracted

Skip statistics are tracked and reported: total_raw, url_only, short_title, no_title.

Stage 7: Validation

Crate: hallucinator-core (see Concurrency Model for the full deep dive)

Each reference goes through:

Coordinator picks up reference from job queue
Local DB query (DBLP offline, ACL offline) — inline, < 1ms
If verified locally → skip all remote DBs, emit result immediately
Cache pre-check — synchronously check cache for all remote DBs
If verified from cache → skip all drainers
Fan out cache-miss DBs to per-DB drainer queues
Drainer queries DB — rate-limited HTTP call
Author validation — compare PDF authors against DB authors
Early exit — if any drainer verifies, others skip remaining work

Database Query Flow (per reference, per DB)

Drainer receives job
  │
  ├─ Already verified? → skip
  ├─ Cancelled? → skip
  ├─ Requires DOI but ref has none? → skip
  │
  ▼
Rate limit acquire (governor token)
  │
  ▼
Cache check
  ├─ Cache hit → return cached result
  │
  ▼
HTTP request (with timeout)
  │
  ├─ Success + title found → author validation
  │     ├─ Authors match → set verified flag
  │     └─ Authors don't match → record mismatch
  ├─ Success + title not found → NoMatch
  ├─ 429 Rate Limited → adaptive backoff + retry
  └─ Error/Timeout → record failure
  │
  ▼
Cache insert (if successful)
  │
  ▼
Decrement remaining counter
  ├─ Not last → done
  └─ Last drainer → finalize result

Stage 8: Result Assembly

File: hallucinator-core/src/pool.rs (finalize_collector)

When the last drainer for a reference completes:

Merge local and remote DbResult lists
Determine status — Verified (any DB matched) > AuthorMismatch (title found, wrong authors) > NotFound
SearxNG fallback — If still NotFound and SearxNG is configured, try web search as last resort
DOI info — Mark DOI as valid/invalid based on DOI backend result
Retraction info — Use inline retraction data extracted from CrossRef response (no extra API call)
Emit events — ProgressEvent::Warning (if DBs timed out) + ProgressEvent::Result
Send result via oneshot channel back to the caller

Output Types

The final Vec<ValidationResult> can be:

Displayed in the CLI with colored output
Navigated in the TUI with sorting/filtering
Streamed via SSE in the web interface
Exported to JSON/CSV/Markdown/Text/HTML via hallucinator-reporting
Returned as Python objects via hallucinator-python

See Export Formats for output schema details.

Keyboard shortcuts

Hallucinator Documentation