Hallucinator TUI Design Document

Design only. No implementation decisions about backend capabilities — if the backend can’t support something yet, that’s a separate problem.

Who uses this and why

Area chairs / senior PC members reviewing 20-100 submissions for a venue. They need to triage: which papers have suspicious references, and how suspicious? They don’t read every result — they scan for red flags, drill in when something looks off, then move on.

Reviewers checking a handful of assigned papers (3-8 typically). More likely to read results carefully. May re-run individual papers after authors revise.

Demo / conference hallway context. Someone shows this to a colleague. First impression matters — it should look like a tool built by someone who gives a shit. But the flash has to be load-bearing: every visual element should communicate something useful.

Design principles

Information density over decoration. CS people read dense UIs comfortably. Don’t waste space on padding when you could show data. Think Bloomberg terminal, not macOS settings.
Glanceable status. At any point you should be able to look at the screen for <1 second and know: how far along are we, is anything wrong, what needs my attention.
Progressive disclosure. Summary first, details on demand. The batch view shows paper-level status. Drill into a paper to see references. Drill into a reference to see per-database results.
Don’t block the user. Analysis takes time. The user should be able to browse already-completed results while new papers are still running.
Adaptive layout. Must be usable at 80x24 (cramped but functional) and take advantage of 200+ column modern terminals. Not two separate layouts — one layout that flexes.

Screens

There are four screens. You’re always on exactly one.

Screen 1: Queue

The entry point. Shows all papers (1 or 50) and their status.

 HALLUCINATOR                                          ██████░░░░ 12/50
─────────────────────────────────────────────────────────────────────────
 #   Paper                          Refs  ✓   ⚠   ✗   ☠   Status
─────────────────────────────────────────────────────────────────────────
  1  arxiv_2024_llm_survey.pdf       38  34   2   1   1   DONE
  2  neurips_submission_042.pdf       27  18   1   0   0   ████░░ 19/27
  3  review_response_v2.pdf          15   —   —   —   —   QUEUED
  4  transformer_scaling.pdf          —   —   —   —   —   QUEUED
  5  rlhf_safety_paper.pdf            —   —   —   —   —   QUEUED
 ...
 48  federated_privacy.pdf            —   —   —   —   —   QUEUED
 49  code_generation_bench.pdf        —   —   —   —   —   QUEUED
 50  multimodal_reasoning.pdf         —   —   —   —   —   QUEUED
─────────────────────────────────────────────────────────────────────────
 ✓ 34 verified  ⚠ 3 mismatch  ✗ 1 not found  ☠ 1 retracted     3:42 elapsed

 [enter] open  [r] retry failed  [d] delete  [a] add files  [q] quit

Columns:

# — sequential, stable. Not the filename, because filenames are long.
Paper — truncated filename. Full path shown on hover/focus.
Refs — total reference count (blank until PDF is parsed).
✓ ⚠ ✗ ☠ — counts by verdict. These are the triage signal.
Status — QUEUED, inline progress bar while running, DONE when finished. If errors occurred during parsing/extraction: ERROR.

Behavior:

Papers are listed in queue order. Currently-running papers float to the top of the “not done” section (done papers above, then active, then queued).
The cursor (highlighted row) selects a paper. Press Enter to go to Screen 2.
While papers are running, counts update live. The overall progress bar in the header updates.
Bottom row shows aggregate totals across all completed papers.

Sorting: Default is queue order. Allow re-sort by column (keybind or click header). Most useful sort: by ✗ descending — puts the most suspicious papers at top. A reviewer running 50 papers wants to see “which 5 papers have the most not-found references” immediately.

Why this works for 1 paper: If you pass a single PDF, this screen still appears but with one row. It shows the progress bar filling up, counts incrementing. When done, it auto-focuses that row so pressing Enter takes you straight to the results. Feels natural, not like a degenerate case of a batch view.

Filtering: Simple text filter on filename. Type / to start filtering (vim convention). Also filter by status: e.g., f cycles through “all → has problems → done → running → queued”. The most common filter is “show me only the papers that have problems.”

Screen 2: Paper

Shows all references for one paper and their verdicts.

 HALLUCINATOR > arxiv_2024_llm_survey.pdf               ██████████ 38/38
─────────────────────────────────────────────────────────────────────────
 #   Reference                                     Verdict     Source
─────────────────────────────────────────────────────────────────────────
  1  Vaswani et al. "Attention Is All You Need"    ✓ verified  arXiv
  2  Brown et al. "Language Models are Few-Shot..." ✓ verified  S2
  3  Wei et al. "Chain-of-Thought Prompting..."    ✓ verified  CrossRef
  4  Smith & Jones "Recursive Self-Improvement..." ✗ not found  —
  5  Zhang et al. "Emergent Abilities of..."       ⚠ mismatch  DBLP
  6  Chen et al. "Evaluating Large Language..."    ✓ verified  arXiv
 ...
 37  Wang et al. "Constitutional AI..."            ✓ verified  S2
 38  Davis "On the Retraction of..."               ☠ retracted CrossRef
─────────────────────────────────────────────────────────────────────────
 ✓ 34 verified  ⚠ 2 mismatch  ✗ 1 not found  ☠ 1 retracted

 [enter] details  [r] retry  [esc] back  [e] export  [s] sort

Columns:

# — reference number as it appears in the paper.
Reference — authors + truncated title. Quoted title portion to visually separate it from authors.
Verdict — icon + word. Color-coded: green/yellow/red/magenta.
Source — which database confirmed it (for verified) or — for not found. If multiple DBs confirmed, show the fastest one (the one that actually ended the search via early exit).

Behavior:

If analysis is still running, references appear as they’re processed. Unprocessed references show as dim/grey with ⏳ pending or ⟳ checking status.
Enter on a reference opens Screen 3 (detail view).
r on a specific reference retries just that one.
R (shift) retries all failed/not-found references for this paper.

Active reference animation: The reference currently being checked gets a subtle indicator — a spinner or a cycling set of dots. Nothing aggressive. Just enough to show “this one is live.” If multiple references are being checked concurrently (which they are — 4 at a time), all active ones show the indicator.

Problem-first ordering: Default sort is by reference number (paper order). But s cycles through sort modes, and sort-by-verdict puts not-found and retracted at the top. This is the thing the user actually cares about — “show me the problems.”

Export: e opens a small modal/prompt: export format (json / csv / markdown / plain text) and destination (file path, clipboard). Exports the results for this paper only. From the Queue screen, e exports all papers.

Screen 3: Reference Detail

Full detail on one reference. This is the “prove it” screen — when you see a suspicious result you drill in here to understand why.

 HALLUCINATOR > arxiv_2024_llm_survey.pdf > [4]
─────────────────────────────────────────────────────────────────────────

 REFERENCE [4]
 Smith, J. and Jones, A. (2024)
 "Recursive Self-Improvement in Large Language Models:
  A Theoretical Framework"
 Proceedings of ICML 2024, pp. 1234-1248

 Verdict: ✗ NOT FOUND

 Extracted title:  "Recursive Self-Improvement in Large Language
                    Models: A Theoretical Framework"
 Extracted authors: J. Smith, A. Jones
 Extracted DOI:     none
 Extracted arXiv:   none

 DATABASE RESULTS
─────────────────────────────────────────────────────────────────────────
  Database       Result          Time     Notes
─────────────────────────────────────────────────────────────────────────
  CrossRef       no match        1.2s
  arXiv          no match        0.8s
  DBLP           no match        0.3s     (offline)
  Sem. Scholar   timeout         10.0s    retried: no match (12.4s)
  OpenAlex       no match        2.1s
  ACL            no match        0.4s
  NeurIPS        no match        0.6s
  Europe PMC     no match        1.8s
  PubMed         no match        0.9s
─────────────────────────────────────────────────────────────────────────

 No close matches found in any database.

 [r] retry  [c] copy ref text  [esc] back

For a verified reference, this screen would instead show:

 Verdict: ✓ VERIFIED (arXiv)

 Matched title:  "Attention Is All You Need"
 Match score:     98.2%
 Matched authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez,
                  Kaiser, Polosukhin
 Author overlap:  8/8

 DATABASE RESULTS
─────────────────────────────────────────────────────────────────────────
  Database       Result          Time     Notes
─────────────────────────────────────────────────────────────────────────
  arXiv          ✓ match         0.3s     ← verified (early exit)
  CrossRef       (skipped)        —       early exit
  DBLP           (skipped)        —       early exit
  ...

For an author mismatch:

 Verdict: ⚠ AUTHOR MISMATCH (DBLP)

 Matched title:   "Emergent Abilities of Large Language Models"
 Match score:      96.1%
 Expected authors: Zhang, Wei, Chen
 Found authors:    Wei, Tay, Bommasani, Raffel, Zoph, Borgeaud,
                   Yogatama, Bosma, Zhou, Metzler, Chi, Hashimoto,
                   Vinyals, Liang, Dean, Fedus
 Author overlap:   1/3 (Wei)

Why this screen matters: “Not found” doesn’t always mean hallucinated. Maybe the title extraction mangled something. Maybe the paper is too new. This screen lets a human make the judgment call by seeing exactly what was searched for, what came back, and how long each database took. The timing information helps distinguish “no match” from “everything timed out” — very different confidence levels.

Screen 4: Config

Accessible from any screen via , (comma). Not a modal — a full screen you navigate to and back from, same as the others. Esc returns to wherever you were.

The point: you’re mid-run, you realize you forgot to set your Semantic Scholar API key, or you want to bump concurrency, or you want to disable a database that’s down. You shouldn’t have to quit and relaunch. That’s hostile UX when someone has 30 papers already processed.

 HALLUCINATOR > Config
─────────────────────────────────────────────────────────────────────────

 API Keys
─────────────────────────────────────────────────────────────────────────
  Semantic Scholar    sk-••••••••••••7f2a               [enter] edit
  OpenAlex            (not set)                         [enter] set
─────────────────────────────────────────────────────────────────────────

 Databases
─────────────────────────────────────────────────────────────────────────
  CrossRef            ✓ enabled
  arXiv               ✓ enabled
  DBLP                ✓ enabled  (offline: ~/dblp.db)
  Sem. Scholar        ✓ enabled
  OpenAlex            ○ disabled  (no API key)
  ACL                 ✓ enabled
  NeurIPS             ✓ enabled
  Europe PMC          ✓ enabled
  PubMed              ✓ enabled
─────────────────────────────────────────────────────────────────────────

 Concurrency & Timeouts
─────────────────────────────────────────────────────────────────────────
  Parallel references         4          [enter] edit
  DB query timeout           10s         [enter] edit
  Retry timeout              45s         [enter] edit
  Request delay              1.0s        [enter] edit
─────────────────────────────────────────────────────────────────────────

 Display
─────────────────────────────────────────────────────────────────────────
  Theme                      green       [enter] toggle
  Notifications              bell        [enter] cycle
─────────────────────────────────────────────────────────────────────────

 [enter] edit  [space] toggle  [esc] back

Sections:

API Keys. Shows masked keys (last 4 chars visible) for any keys already set. Press Enter to edit — opens an inline text input. Keys entered here take effect immediately for subsequent queries. They override env vars / CLI flags for this session.

Databases. Toggle individual databases on/off with Space. If a database requires an API key that isn’t set, it shows as ○ disabled with the reason. If DBLP is in offline mode, show the DB path. Toggling a database off mid-run means it won’t be queried for remaining references (already-completed results are unaffected). Useful when a database is down and you don’t want to waste timeout budget on it.

Concurrency & Timeouts. Edit numeric values inline. Changing parallel references mid-run adjusts the worker pool for subsequent references. Changing timeouts affects subsequent queries. These are the knobs you reach for when the tool is going too slow (bump concurrency) or when a database is flaky (bump timeout).

Display. Theme toggle (green/modern) applies immediately — the screen redraws in the new palette. Notification mode cycles through off → bell → desktop → bell+desktop.

Behavior notes:

Changes take effect immediately for new work. They don’t retroactively affect completed results or in-flight queries.
Changes are session-scoped by default. They don’t persist to disk unless the user explicitly saves.
S (shift-s) on the config screen saves current settings to ~/.config/hallucinator/config.toml. This becomes the new default for future runs. A small confirmation appears: “Saved to ~/.config/hallucinator/config.toml”.
The config file is also loaded on startup if it exists, so CLI flags and env vars override the config file, and the TUI config screen overrides everything. Precedence: TUI edits > CLI flags > env vars > config file > defaults.

Why , as the keybind: It’s unused, easy to reach, and has precedent in tools like Neovim/Helix where , is a common leader key for settings. It doesn’t conflict with any navigation or action key.

Why a full screen and not a modal: The config has too many sections and options to fit comfortably in a modal overlay. A full screen gives room for the settings to breathe and be scannable. Also, settings aren’t something you adjust while simultaneously reading results — you go in, tweak, go back. Full screen matches that flow.

Adaptive layout

Narrow terminals (< 100 columns)

Queue screen: hide Refs column, truncate filenames more aggressively, collapse ✓ ⚠ ✗ ☠ into a single “problems” count.
Paper screen: hide Source column, truncate titles earlier.
Detail screen: wraps naturally since it’s mostly prose.

Wide terminals (140+ columns)

Queue screen: show full filenames, add an “elapsed time” column, add a “problems” column that sums ✗ + ☠ for quick scanning.
Paper screen: show full titles without truncation, add a “time” column showing how long validation took per reference.
Detail screen: split into two panes — reference info on the left, database results on the right (side-by-side instead of stacked).

Very wide terminals (200+ columns)

Queue screen: could show a mini-sparkline per paper showing distribution of verdicts as a tiny bar chart inline. Pure gravy.
Paper screen: show the raw reference text in a right-side pane alongside the parsed/structured view. Useful for debugging extraction issues.

Short terminals (< 30 rows)

Collapse the header to a single line.
Collapse the footer/keybinds bar to a single line.
Use available rows for data.

Live activity panel (overlay, not a screen)

Toggleable with Tab. This is the “flashy” part, but it earns its space.

When active, it takes the right 40-50 columns (or bottom third on narrow terminals) and shows:

 ACTIVITY
────────────────────────────────
 Database Health
  CrossRef    ●  142ms avg
  arXiv       ●  89ms avg
  DBLP        ●  12ms avg  (offline)
  Sem.Scholar ◐  1.2s avg  throttled
  OpenAlex    ●  203ms avg
  ACL         ●  67ms avg
  NeurIPS     ●  94ms avg
  Europe PMC  ●  312ms avg
  PubMed      ○  down

 Rate Limits
  CrossRef   ░░░░▓▓░░░░  12/50
  S2         ░▓░░░░░░░░   3/100

 Throughput
  refs/min  ▁▂▃▅▇█▇▅▃▄▆█  avg: 8.2

 Active Queries
  → CrossRef: "Recursive Self-Imp..."
  → arXiv: "Recursive Self-Imp..."
  → DBLP: "Recursive Self-Imp..."
  ← S2: 429 Too Many Requests

Database health indicators:

● — healthy (responding, <500ms average)
◐ — degraded (slow, rate limited, intermittent errors)
○ — down (repeated failures, all timeouts)

Why this panel exists: When you’re running 50 papers, this answers “why is it going slow” without you having to guess. If Semantic Scholar is throttling you, you can see it. If PubMed is down, you know those “not found” results are lower confidence. It converts backend infrastructure state into visible, actionable information.

Why it’s an overlay and not a screen: You want to see this while browsing results. It’s context, not content.

Keyboard model

Global (work on any screen)

Key	Action
`q`	Quit (confirms if analysis still running)
`,`	Open config screen
`Tab`	Toggle activity panel
`?`	Toggle keybind help overlay
`Ctrl+C`	Cancel current analysis / force quit

Key	Action
`↑/↓` `j/k`	Move cursor
`Enter`	Drill in (queue→paper→reference)
`Esc`	Back up one level
`g` / `G`	Jump to top / bottom of list
`Ctrl+D/U`	Page down / page up
`/`	Start text filter
`n` / `N`	Next / previous filter match

Actions

Key	Context	Action
`r`	Queue	Retry all failed refs in paper
`r`	Paper	Retry selected reference
`R`	Paper	Retry all failed refs in paper
`r`	Detail	Retry this reference
`e`	Queue	Export all results
`e`	Paper	Export this paper’s results
`s`	Queue / Paper	Cycle sort mode
`f`	Queue	Cycle status filter
`a`	Queue	Add more files
`d`	Queue	Remove paper from queue
`y`	Detail	Copy reference text to clipboard
`S`	Config	Save settings to config file
`Space`	Config	Toggle selected setting

Mouse

Click row to select.
Double-click row to drill in.
Click column header to sort.
Scroll wheel scrolls the list.
Click the Tab activity panel area to toggle it.

Not every action needs a mouse equivalent. Keyboard is the primary interface. Mouse is a convenience for people who reach for it instinctively.

Color

The palette should work on both dark and light terminal backgrounds but optimize for dark (that’s what the target audience uses).

Verdict colors

Verdict	Color	Rationale
Verified	Green (bold)	Universal “good”
Author mismatch	Yellow	Warning, needs human judgment
Not found	Red	Danger / suspicious
Retracted	Magenta (bold)	Alarming, distinct from not-found
Pending	Dim / grey	Not yet actionable
Checking	Cyan	Active, in-progress

UI chrome

Borders and separators: dim grey. Should recede.
Headers/labels: white, bold.
Selected row: reverse video (swap fg/bg). High contrast, works on any color scheme.
Active database queries in activity panel: cyan.
Rate limit bars: green → yellow → red gradient as capacity fills.

Emphasis principle

Color is never the only signal. Every verdict also has a text label and a distinct icon character (✓ ⚠ ✗ ☠). This matters for:

Accessibility (color vision deficiency).
Monochrome terminals / piped output.
Screenshots in papers or blog posts that may be printed B&W.

Startup sequence

$ hallucinator ~/papers/*.pdf

 ░█░█░█▀█░█░░░█░░░█░█░█▀▀░▀█▀░█▀█░█▀█░▀█▀░█▀█░█▀▄
 ░█▀█░█▀█░█░░░█░░░█░█░█░░░░█░░█░█░█▀█░░█░░█░█░█▀▄
 ░▀░▀░▀░▀░▀▀▀░▀▀▀░▀▀▀░▀▀▀░▀▀▀░▀░▀░▀░▀░░▀░░▀▀▀░▀░▀

 Loading 50 PDFs...
 Databases: CrossRef arXiv DBLP(offline) S2 OpenAlex ACL NeurIPS PMC PubMed

Brief. The banner renders instantly (no animation — animation on startup is a delay). The database line confirms which sources are enabled and shows if DBLP is running in offline mode. Then it transitions to the Queue screen within ~1 second as PDFs start parsing.

If the banner won’t fit (terminal < 70 columns), skip it and go straight to the Queue screen.

The “50 papers at 2am” workflow

This is the scenario that matters most. An area chair has a deadline. They run:

$ hallucinator ~/openreview-downloads/*.pdf

First 10 seconds: Queue screen populates with 50 filenames. First paper starts processing. Activity panel shows databases warming up.

Next few minutes: Papers process. The user watches for a bit, sees the system is working, then does something else in another terminal tab.

They come back: 35 papers done. They press s to sort by problems. The 3 papers with not-found references float to the top. They press Enter on the worst one, see 4 not-found references, drill into each to see what the tool searched for. Two look like genuine hallucinations (zero matches across all 9 databases). Two look like very recent preprints that just aren’t indexed yet (only in arXiv, which timed out).

They press Esc back to Queue, check the next problem paper. After 5 minutes of triage they have a clear picture: 2 submissions with probable fabricated references, 1 with a retracted citation the authors should have caught.

They press e, export a JSON report of all results, and attach it to their AC notes.

What mattered: Sort by problems. Fast drill-in/drill-out. Export. Not the sparklines or the database race visualization — those were nice for the first 30 seconds but the actual utility is in triage speed.

Non-goals for TUI

PDF viewing. Don’t try to render the paper. Users have their own PDF viewer open alongside.
Editing results. The TUI is read-only for results. No “mark as false positive” or annotation features. That’s a different tool.
Log viewer. The activity panel is not a log. Don’t show every HTTP request. Show state (database health, rate limits, throughput) not events.

Visual mockups: states and scenarios

Queue screen — mid-run, sorted by problems

The area chair has been running for a few minutes and just hit s to sort by descending problem count.

 HALLUCINATOR                                          ████████░░ 41/50
─────────────────────────────────────────────────────────────────────────
 #   Paper                          Refs  ✓   ⚠   ✗   ☠   Status
─────────────────────────────────────────────────────────────────────────
  7  sketchy_submission_v3.pdf        22  14   1   4   1   DONE
 31  workshop_paper_draft.pdf         18  11   2   3   0   DONE
 12  llm_alignment_study.pdf          35  28   3   2   0   DONE
  1  arxiv_2024_llm_survey.pdf        38  34   2   1   1   DONE
 19  multiagent_reasoning.pdf         29  27   1   1   0   DONE
  3  safety_evaluation.pdf            41  39   2   0   0   DONE
  5  federated_learning.pdf           33  33   0   0   0   DONE
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
 42  code_gen_benchmark.pdf           27  18   —   —   —   █████░ 18/27
 43  vision_transformer.pdf           19   —   —   —   —   PARSING
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
 44  diffusion_models.pdf              —   —   —   —   —   QUEUED
 45  robustness_theory.pdf             —   —   —   —   —   QUEUED
 ...8 more
─────────────────────────────────────────────────────────────────────────
 ✓ 412 verified  ⚠ 18 mismatch  ✗ 11 not found  ☠ 2 retracted  12:34

 sorted: problems ↓          [enter] open  [s] sort  [f] filter  [q] quit

Note the three visual zones separated by dashed rules: done (sorted), active (running now), and queued. The user’s eye goes straight to the top — paper #7 with 4 not-found and 1 retracted is the one to investigate.

Queue screen — narrow terminal (80 columns)

Same data, collapsed for a small terminal:

 HALLUCINATOR                        ████████░░ 41/50
────────────────────────────────────────────────────────
  #  Paper                     Probs  Status
────────────────────────────────────────────────────────
   7 sketchy_submission_v3…     5     DONE
  31 workshop_paper_draft…      3     DONE
  12 llm_alignment_study…       2     DONE
   1 arxiv_2024_llm_surv…       2     DONE
  19 multiagent_reasoning…      1     DONE
   3 safety_evaluation…         0     DONE
   5 federated_learning…        0     DONE
  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
  42 code_gen_benchmark…        —     █████░ 18/27
  43 vision_transformer…        —     PARSING
  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
  44 diffusion_models…          —     QUEUED
  ...
────────────────────────────────────────────────────────
 [enter] open  [s] sort  [f] filter  [q] quit

Individual verdict columns collapse into a single “Probs” count (✗ + ☠ + ⚠). Still scannable. The user loses per-type breakdown at a glance but gains it back by drilling in.

Queue screen — wide terminal with activity panel (180+ columns)

 HALLUCINATOR                                          ████████░░ 41/50         │ ACTIVITY
──────────────────────────────────────────────────────────────────────────────────┤────────────────────────────────────
 #   Paper                              Refs  ✓   ⚠   ✗   ☠   Time   Status   │ Database Health
──────────────────────────────────────────────────────────────────────────────────│  CrossRef    ●  142ms
  7  sketchy_submission_v3.pdf            22  14   1   4   1   0:48   DONE     │  arXiv       ●   89ms
 31  workshop_paper_draft.pdf             18  11   2   3   0   0:35   DONE     │  DBLP        ●   12ms  offline
 12  llm_alignment_study.pdf              35  28   3   2   0   1:12   DONE     │  Sem.Scholar ◐  1.2s   throttled
  1  arxiv_2024_llm_survey.pdf            38  34   2   1   1   1:31   DONE     │  OpenAlex    ●  203ms
 19  multiagent_reasoning.pdf             29  27   1   1   0   0:55   DONE     │  ACL         ●   67ms
  3  safety_evaluation.pdf                41  39   2   0   0   1:44   DONE     │  NeurIPS     ●   94ms
  5  federated_learning.pdf               33  33   0   0   0   1:22   DONE     │  Europe PMC  ●  312ms
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│  PubMed      ○  down
 42  code_gen_benchmark.pdf               27  18   —   —   —   0:22   ████░    │
 43  vision_transformer.pdf               19   —   —   —   —    —     PARSING  │ Rate Limits
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│  CrossRef  ░░░▓▓▓░░░░  18/50
 44  diffusion_models.pdf                  —   —   —   —   —    —     QUEUED   │  S2        ░▓▓▓▓▓▓▓░░  71/100
 45  robustness_theory.pdf                 —   —   —   —   —    —     QUEUED   │
 ...8 more                                                                      │ Throughput (refs/min)
──────────────────────────────────────────────────────────────────────────────────│  ▁▂▃▅▇█▇▅▃▁▂▅▇█▆▅  avg: 8.2
 ✓ 412  ⚠ 18  ✗ 11  ☠ 2                                           12:34       │
 sorted: problems ↓    [enter] open  [s] sort  [f] filter  [Tab] panel  [q] quit│

The activity panel earns its space here. You can see S2 is almost at its rate limit (71/100) — that’s why it shows as “throttled.” PubMed is straight up down. The throughput sparkline shows a dip about 2 minutes ago (probably when S2 started throttling) and a recovery.

Paper screen — actively checking, with activity panel

Drilled into paper #42 which is still running:

 HALLUCINATOR > code_gen_benchmark.pdf                   █████░░░░ 18/27        │ ACTIVITY
──────────────────────────────────────────────────────────────────────────────────┤──────────────────────────────
 #   Reference                                     Verdict      Source          │ Database Health
──────────────────────────────────────────────────────────────────────────────────│  CrossRef    ●  142ms
  1  Chen et al. "Evaluating Large Language..."    ✓ verified   arXiv           │  arXiv       ●   89ms
  2  Austin et al. "Program Synthesis with..."     ✓ verified   S2              │  DBLP        ●   12ms  offline
  3  Li et al. "Competition-Level Code..."         ✓ verified   CrossRef        │  Sem.Scholar ◐  1.2s
  4  Hendrycks et al. "Measuring Coding..."        ✓ verified   CrossRef        │
  5  Nijkamp et al. "CodeGen: An Open..."          ✓ verified   S2              │ Active Now
 ...                                                                            │  [19] → CrossRef  ⟳
 17  Fried et al. "InCoder: A Generative..."       ✓ verified   DBLP            │  [19] → arXiv     ⟳
 18  Allal et al. "SantaCoder: Don't..."           ✓ verified   S2              │  [19] → DBLP      ✓ 12ms
────────────────────────────────────────────────────────────────────────────────  │  [19] → S2        ⟳
 19  Wang et al. "Execution-Based Code..."         ⟳ checking   3/9             │  [20] → CrossRef  ⟳
 20  Fake et al. "An Invented Paper..."            ⟳ checking   1/9             │  [20] → arXiv     waiting
 21  Zhang et al. "RepoCoder: Repository..."       ⟳ checking   0/9             │  [21] → queued
 22  Liu et al. "Is Your Code Generated..."        ⏳ pending                    │  [22] → queued
 ...                                                                            │
 27  Peng et al. "The Impact of AI on..."          ⏳ pending                    │
──────────────────────────────────────────────────────────────────────────────────│
 ✓ 18 verified  so far                                                          │
 [enter] details  [r] retry  [esc] back  [s] sort                              │

The activity panel here shows per-query granularity: reference [19] has 3 databases done (DBLP already returned a match at 12ms but it’s still waiting on others — or maybe early exit will kick in and cancel the remaining). Reference [20] just started. [21] and [22] are queued waiting for a slot in the concurrency pool.

This is the “database race” made visible. You’re watching 4 references being checked concurrently, each with up to 9 databases racing. When a database returns a match, the whole reference can resolve instantly via early exit.

Paper screen — done, filtered to problems only

After analysis completes, the reviewer presses f to filter to problems:

 HALLUCINATOR > sketchy_submission_v3.pdf                  DONE ✓14 ⚠1 ✗4 ☠1
─────────────────────────────────────────────────────────────────────────
 #   Reference                                     Verdict      Source
─────────────────────────────────────────────────────────────────────────
  3  Zhang & Li "Self-Aware Neural..."             ✗ not found   —
  8  Johnson et al. "Recursive Prompt..."          ✗ not found   —
 11  Chen "Advanced Reasoning in..."               ✗ not found   —
 15  Park et al. "Constitutional Self-..."         ✗ not found   —
 19  Davis et al. "On the Emergent..."             ☠ retracted  CrossRef
  6  Williams et al. "Scaling Laws for..."         ⚠ mismatch   DBLP
─────────────────────────────────────────────────────────────────────────
 showing: problems only (6/22)

 [enter] details  [f] show all  [r] retry  [e] export  [esc] back

Six references instead of 22. The reviewer only needs to look at these. Four not-found references in a 22-reference paper is a strong signal.

Reference detail — close match found but authors wrong

 HALLUCINATOR > sketchy_submission_v3.pdf > [6]
─────────────────────────────────────────────────────────────────────────

 REFERENCE [6]
 Williams, R., Thompson, K., and Garcia, M. (2023)
 "Scaling Laws for Neural Language Models"
 In Proceedings of NeurIPS 2023

 Verdict: ⚠ AUTHOR MISMATCH

 Extracted title:   "Scaling Laws for Neural Language Models"
 Extracted authors:  R. Williams, K. Thompson, M. Garcia

 BEST MATCH (CrossRef)
─────────────────────────────────────────────────────────────────────────
  Matched title:    "Scaling Laws for Neural Language Models"
  Title score:       100.0%
  Found authors:     Jared Kaplan, Sam McCandlish, Tom Henighan,
                     Tom B. Brown, Benjamin Chess, Rewon Child,
                     Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
  Author overlap:    0/3 — no matching authors

  DOI:               10.48550/arXiv.2001.08361
  Source:            CrossRef (0.8s)
─────────────────────────────────────────────────────────────────────────

 This paper exists but the cited authors (Williams, Thompson, Garcia)
 don't match the actual authors (Kaplan, McCandlish, et al.).

 ALL DATABASE RESULTS
─────────────────────────────────────────────────────────────────────────
  CrossRef       ⚠ author mismatch  0.8s   (match shown above)
  arXiv          ⚠ author mismatch  0.4s
  DBLP           ⚠ author mismatch  0.1s   (offline)
  Sem. Scholar   timeout            10.0s
  OpenAlex       ⚠ author mismatch  1.2s
  ACL            no match           0.3s
  NeurIPS        no match           0.5s
  Europe PMC     no match           1.1s
  PubMed         no match           0.7s
─────────────────────────────────────────────────────────────────────────

 [r] retry  [y] copy ref text  [esc] back

This is a telltale sign: the paper “Scaling Laws for Neural Language Models” is real (Kaplan et al., 2020), but the submission attributed it to completely fabricated authors. Four databases independently confirmed the title exists with different authors. The detail screen makes this case unambiguous.

Reference detail — retracted paper

 HALLUCINATOR > sketchy_submission_v3.pdf > [19]
─────────────────────────────────────────────────────────────────────────

 REFERENCE [19]
 Davis, P., Reeves, L., and Kang, S. (2021)
 "On the Emergent Properties of Transformer Architectures
  in Low-Resource Settings"
 Journal of Machine Learning Research, 22(1), pp. 1-34

 Verdict: ☠ RETRACTED

 Extracted title:   "On the Emergent Properties of Transformer
                     Architectures in Low-Resource Settings"
 Extracted authors:  P. Davis, L. Reeves, S. Kang

 MATCH (CrossRef)
─────────────────────────────────────────────────────────────────────────
  Matched title:    "On the Emergent Properties of Transformer
                     Architectures in Low-Resource Settings"
  Title score:       100.0%
  Found authors:     P. Davis, L. Reeves, S. Kang
  Author overlap:    3/3 ✓

  DOI:               10.xxxx/jmlr.2021.xxxxx
  Source:            CrossRef (1.1s)

  ╔══════════════════════════════════════════════════════════════════╗
  ║  ☠ RETRACTION NOTICE                                           ║
  ║                                                                 ║
  ║  This paper was retracted on 2022-03-15.                       ║
  ║  Retraction DOI: 10.xxxx/jmlr.2022.retract.xxxxx              ║
  ║  Reason: "Results could not be reproduced; data fabrication     ║
  ║  suspected."                                                    ║
  ╚══════════════════════════════════════════════════════════════════╝
─────────────────────────────────────────────────────────────────────────

 [r] retry  [y] copy ref text  [esc] back

The retraction notice gets a heavy box border — it’s the most important piece of information on this screen and should be impossible to miss.

Single-paper mode — just started

When invoked with a single PDF:

 HALLUCINATOR
─────────────────────────────────────────────────────────────────────────
 arxiv_2024_llm_survey.pdf                         38 references found
─────────────────────────────────────────────────────────────────────────
  1  Vaswani et al. "Attention Is All You Need"    ✓ verified   arXiv
  2  Brown et al. "Language Models are Few-..."    ✓ verified   S2
  3  Wei et al. "Chain-of-Thought Prompting..."    ✓ verified   CrossRef
  4  Bubeck et al. "Sparks of Artificial..."       ✓ verified   S2
  5  Touvron et al. "LLaMA: Open and..."           ⟳ checking   4/9
  6  Chowdhery et al. "PaLM: Scaling..."           ⟳ checking   2/9
  7  Hoffmann et al. "Training Compute-..."        ⟳ checking   0/9
  8  Ouyang et al. "Training language..."          ⟳ checking   0/9
  9  Bai et al. "Constitutional AI:..."            ⏳ pending
 10  Raffel et al. "Exploring the Limits..."       ⏳ pending
 ...
 38  Kojima et al. "Large Language Models..."      ⏳ pending
─────────────────────────────────────────────────────────────────────────
 ████░░░░░░ 4/38    ✓ 4 verified                            0:12

 [enter] details  [Tab] activity  [q] quit

No queue screen — it drops you directly into the paper view. The progress bar and running counts update live. This feels immediate and purposeful. When it finishes, the status line updates and you can browse results or export.

Pressing e on any screen:

         ┌─ Export ──────────────────────────────┐
         │                                       │
         │  Format:  [JSON]  CSV  Markdown  Text │
         │  Scope:   This paper / All papers     │
         │  Output:  ~/hallucinator-results.json │
         │                                       │
         │          [Export]    [Cancel]          │
         └───────────────────────────────────────┘

Minimal modal. Arrow keys or tab to move between options. Enter to confirm. Esc to cancel. The output path has a sensible default and is editable.

Help overlay

Pressing ? on any screen:

 ┌─ Keybindings ─────────────────────────────────────────────────┐
 │                                                               │
 │  Navigation                     Actions                      │
 │  ↑↓ j/k    move cursor          r    retry reference/paper   │
 │  Enter      drill in             R    retry all failed        │
 │  Esc        back                 e    export results          │
 │  g/G        top/bottom           s    cycle sort mode         │
 │  Ctrl+D/U   page down/up         f    cycle filter            │
 │  /          search/filter        a    add files (queue)       │
 │  n/N        next/prev match      d    remove paper (queue)    │
 │                                  y    copy to clipboard       │
 │  Global                                                       │
 │  Tab        toggle activity      ?    this help               │
 │  ,          config               Ctrl+C  cancel/force quit    │
 │  q          quit                                              │
 │                                                               │
 │                                             [?/Esc] close     │
 └───────────────────────────────────────────────────────────────┘

Semi-transparent overlay on top of whatever screen is active. The underlying screen is still visible (dimmed) so you maintain spatial context.

Decisions (resolved)

1. Notification on completion

Terminal bell by default. Works everywhere, zero config. Desktop notification via notify-send / platform equivalent available as opt-in flag (--notify). Don’t overthink this.

2. Results persistence

Two distinct mechanisms:

Temp state (invisible infrastructure). In-progress and completed results write to ~/.cache/hallucinator/runs/<timestamp>/. This is crash safety — if the terminal dies, SSH drops, or the user hits Ctrl+C, the work isn’t lost. The TUI doesn’t expose this to the user. It just exists.

Export (deliberate user action). e key opens the export modal. User picks format (JSON, CSV, Markdown, plain text), scope (one paper or all), and destination. This produces the actual deliverable — the report they attach to AC notes or share with co-reviewers.

Resume (future). Not in v1. Eventually: hallucinator --resume reads from the temp state dir and picks up where it left off. The temp state format should be designed with this in mind even if we don’t build the resume path yet — don’t paint ourselves into a corner.

3. Reference text preview pane

Yes. Shown on the Paper screen (Screen 2) when terminal height >= 40 rows. Located below the reference list, separated by a horizontal rule. Shows the raw reference text as extracted from the PDF for the currently-selected reference.

Updates in real-time as the cursor moves through the reference list (file-manager-style preview). This is the expected behavior and the rendering cost is trivial — it’s just text reflow.

On terminals shorter than 40 rows, the preview is hidden. The user can still see the raw text by drilling into Screen 3.

 ...
  4  Smith & Jones "Recursive Self-Imp..."  ✗ not found   —
> 5  Zhang et al. "Emergent Abilities..."   ⚠ mismatch   DBLP
  6  Chen et al. "Evaluating Large..."      ✓ verified   arXiv
 ...
─────────────────────────────────────────────────────────────────
 [5] Zhang, W., Wei, J., and Chen, L., "Emergent Abilities of
 Large Language Models," in Proceedings of the International
 Conference on Machine Learning (ICML), 2023, pp. 4812-4830.
─────────────────────────────────────────────────────────────────

This earns its space. When the extracted title looks wrong (mangled by hyphenation, ligature issues, or a bad parse), you see it instantly without an extra keypress.

4. Color themes

Two themes, toggled via --theme=green or --theme=modern. No theming framework, no theme.toml. Just two palette structs.

green (default): Dark background, green/cyan primary text. Terminal hacker aesthetic. The one that makes people at poster sessions say “what is that.” Verdict colors as specified in the Color section above.

modern: Dark background, white primary text, electric blue accents. Cleaner, more subdued. For people who think the green is too much, or for screenshots in formal reports where neon green looks unserious.

Both palettes follow the same rules: verdict colors stay semantically consistent (green=verified, red=not found, etc.), only the chrome and accent colors differ.

5. Inline retry feedback

Both spinner and text. The verdict cell shows an animated spinner character cycling through frames (◜ ◝ ◞ ◟) followed by static “retrying” text:

  4  Smith & Jones "Recursive Self-..."    ◝ retrying    —

The spinner provides motion (“something is happening”) while the text provides meaning (“what is happening”). Consistent with how the ⟳ checking state already works during initial analysis — just a different animation to distinguish retry from first pass.

When the retry completes, the cell snaps to the new verdict. No transition animation — just the immediate update. The change in color (from cyan retrying to green/red/yellow result) is transition enough.

6. Phase 4 decisions

File picker screen. When launched with no PDF arguments, the banner dismisses into an interactive file picker instead of an empty queue. The file picker is a custom implementation using std::fs::read_dir — no external dependency (ratatui-explorer was considered but rejected to keep the dependency tree small). Directories are navigable, .pdf files are togglable with Space, Enter confirms selection and returns to the queue. The a key on the queue screen reopens the picker to add more files.

Manual start. PDFs load into the queue in Queued state but do not begin processing automatically. The user reviews the queue, adjusts configuration, then presses Space to start. A prominent [Space] Start indicator in the footer makes this discoverable. Processing is deferred via a BackendCommand channel — the backend listener receives a ProcessFiles command containing the file list, starting index, and a Config struct rebuilt from the current ConfigState. This means config edits made before pressing Space take effect.

Concurrency configurable from config screen. The config screen (, key) is now fully interactive. Tab cycles between sections (API Keys, Databases, Concurrency, Display). j/k navigates items within a section. Enter opens a text-editing mode for numeric and string fields. Space toggles database checkboxes. Cursor is clamped to valid bounds per section. Config values are populated from CLI flags, environment variables, and defaults — not hardcoded.

Activity panel shown by default. The activity panel (right sidebar) is now visible on launch instead of hidden. It shows database health with query counts and average response times, a throughput sparkline with refs/sec rate, active query list (which references are currently being checked), and a summary of total completed references. Throughput data is fed by a tick-based bucketing system (every ~1 second).

Mouse click support. Clicking a row in the queue or paper table selects it. The rendered table area is stored after each draw, and click coordinates are mapped to table rows accounting for borders and headers. Double-click (same row within 500ms) drills in.

False positive marking remains a non-goal. Per the original design (section “Explicit non-goals”), false-positive toggling is deferred. The TUI is an analysis tool, not an annotation tool.

Keyboard shortcuts

Hallucinator Documentation