hallx — Hallucination Risk Scoring for Production LLM Pipelines
LLMs don't fail loudly. When a model hallucinates, it returns a confident, well-formed response with nothing in the output signalling that something went wrong. Existing approaches like eval frameworks (RAGAS, TruLens) are offline, batch-oriented tools built for pre-deployment testing not for checking individual responses at runtime. Using a second LLM as a judge doubles latency and cost, and still relies on a model to verify another model. hallx sits between the model and your downstream system and scores each response inline using three heuristic signals, no ground truth labels, no secondary model calls.
How it scores:
Schema — validates structure and flags null-injection, where the model fills a required field with nothing meaningful
Consistency — re-runs generation 2–4 times and measures drift; an uncertain model doesn't produce stable outputs
Grounding — checks if response claims have any textual anchor in the context documents provided
These combine into a confidence score (0.0–1.0) and a risk_level of high, medium, or low. Skipped checks are penalised, partial analysis doesn't pass silently.
What you get back:
A
proceedorretryaction with a suggested temperature and prompt improvement hintsHallxHighRiskErrorin strict mode, for hard blocking on sensitive pathsAn
issueslist for traceability and auditing
Other things it includes:
SQLite-backed feedback store to record reviewed outcomes and generate calibration reports over time
Safety profiles —
fast,balanced,strict— controlling how many consistency runs are made and how harshly incomplete checks are penalisedAdapters for OpenAI, Anthropic, Gemini, Ollama, HuggingFace, and more; sync and async both supported
Available on PyPI (pip install hallx), MIT licensed, pure Python.