Methodology

No black boxes. Every step is auditable.

JurisLens exists to make claims about judicial bias defensible in court and in print. That means every number in every report can be traced back to a specific sentence, a specific model version, and a specific statistical test.

  1. 01

    Ingestion from CENDOJ

    We query the official CENDOJ search interface using the same filters a researcher would: órgano judicial, ponente, materia, and date range. Each matching sentence is downloaded as a PDF and stored with its official ROJ identifier. We cache by ROJ ID so the same sentence is never re-fetched — across all users.

  2. 02

    Normalization & extraction

    PDFs are parsed into structured text. We extract metadata (court, date, ponente, parties, charges, ruling) using deterministic parsers tuned to the standard Spanish sentence template, and fall back to LLM extraction only when the template deviates. Every extracted field is shown beside the source PDF so you can verify it.

  3. 03

    NLP classification

    We use a constrained LLM pipeline to label each sentence on the dimensions you select: defendant/plaintiff gender, age bracket, nationality references, framing of testimony, mitigating vs aggravating language, and outcome severity. Prompts and model versions are pinned per study so re-runs are reproducible.

  4. 04

    Statistical inference

    We compute group means, medians, and dispersion for the outcome you chose (e.g. sentence length in months, acquittal rate, damages awarded). We run Welch's t-test, Mann–Whitney U, and chi-squared as appropriate, and report effect size (Cohen's d or odds ratio) alongside p-values. Multiple-comparison correction (Benjamini–Hochberg) is applied automatically.

  5. 05

    Outlier review & caveats

    Statistical signals don't equal bias. The report surfaces the top outlier cases driving each finding so a human can read them. We also automatically flag confounders (sample size, charge mix, time trends) and downgrade conclusions accordingly. If n is too small for inference, we say so.

  6. 06

    Reproducible report

    Output is a versioned PDF + a permanent URL. It includes every input parameter, the exact CENDOJ query, the list of ROJ IDs analyzed, all model versions, raw CSVs, and the statistical code. Anyone with the link can re-run the study and confirm — or challenge — the result.

Data sources & licensing

CENDOJ (Consejo General del Poder Judicial)

Public Spanish case law repository. Sentences are public records; JurisLens accesses them at human-rate intervals, respects robots.txt, and caches centrally so users do not re-scrape. We never republish sentence PDFs — we link back to the official CENDOJ URL.

Personal data & anonymization

CENDOJ already anonymizes parties. JurisLens does not re-identify, cross-reference with other databases, or surface names of private individuals. Only public officials acting in their public capacity (judges, prosecutors) are named.

Statistical packages

SciPy for inference, statsmodels for multiple-comparison correction. Versions pinned per study. All test choices are documented in the report.

Models

Classification uses Gemini and GPT models via the Lovable AI Gateway. The exact model ID and prompt hash for each label is recorded in the report appendix.

What JurisLens is not