Extenote
← All Docs

Refcheck

Refcheck verifies your bibliographic references against external academic databases to catch errors, find missing metadata, and ensure citation accuracy.

Overview

The refcheck command looks up your bibtex entries in external APIs and compares the metadata. For each entry, it:

  1. Searches the provider’s API using DOI or title
  2. Compares local metadata against the remote source
  3. Reports matches and mismatches at the field level
  4. Stores detailed results in the check_log frontmatter field

Refcheck Workflow (CLI + Clipper)

Refcheck runs in two places: the CLI and the browser clipper extension. A typical flow is:

  1. Clip a paper with the browser extension to create a bibtex_entry
  2. Pick the best source in the clipper (DBLP/Semantic Scholar by default; OpenAlex and Crossref optional)
  3. Validate with refcheck in the CLI or use the clipper’s validation mode
  4. Fix mismatches using values stored in check_log

If you’re using the clipper in download mode, move the saved markdown file into your vault before running refcheck. See extensions/clipper/README.md for installation and API mode setup.

Usage

# Check all bibtex entries (uses 'auto' provider by default)
bun run cli -- refcheck

# Check entries in a specific project
bun run cli -- refcheck shared-references

# Check a single file
bun run cli -- refcheck --file references/smith-2024.md

# Preview without updating files
bun run cli -- refcheck --dry-run

# Re-check entries that were already checked
bun run cli -- refcheck --force

# Use a specific provider
bun run cli -- refcheck --provider dblp
bun run cli -- refcheck --provider openalex

# Check only entries matching a path pattern
bun run cli -- refcheck --filter "references/2024/*"

# Limit the number of entries to check
bun run cli -- refcheck --limit 10

# Resume from a specific entry (if interrupted)
bun run cli -- refcheck --start-from smith2024

# Skip first N entries (alternative resume method)
bun run cli -- refcheck --skip 50

# List available providers
bun run cli -- refcheck --list-providers

Providers

CLI providers

The CLI refcheck command supports:

Clipper extension sources

The browser clipper searches DBLP and Semantic Scholar by default, with optional OpenAlex and Crossref toggles. When multiple sources return results, the clipper auto-selects the best match based on completeness rather than a fixed order.

dblp (CLI + clipper)

DBLP is the computer science bibliography database. Best for:

DBLP provides high-quality BibTeX that can be used to correct your entries.

s2 (Semantic Scholar, CLI + clipper)

Semantic Scholar provides broad academic coverage with good abstracts. Best for:

openalex (CLI + clipper)

OpenAlex is an open catalog of 200M+ scholarly works. Best for:

crossref (CLI + clipper)

Crossref is the official DOI metadata registry. Best for:

Refcheck Status

Each checked entry receives one of these statuses:

StatusMeaning
confirmedAll checked fields match the remote source
mismatchOne or more fields differ from the remote source
not_foundPaper was not found in the provider’s database
errorAPI error or processing error occurred
staleRefcheck is older than 30 days (re-validate for freshness)
uncheckedEntry has never been validated

Mismatch Severity

When a mismatch is detected, refcheck also classifies its severity to help you prioritize review:

SeverityMeaningExamples
minorLikely a false positive, probably OKVenue: “NeurIPS 2023” vs “arXiv”; Author initials differ; Year off by 1
majorNeeds human reviewWrong authors (book review matched); DOI mismatch; Title differs

Minor Mismatches (Likely OK)

These patterns are common and usually don’t indicate real errors:

Major Mismatches (Need Review)

These patterns often indicate real issues:

The CLI shows severity in its output:

⚠ [mismatch:minor] smith2023   # Yellow - probably OK
⚠ [mismatch:major] zuboff2019  # Red - needs review

The web app Review page has filters for “Mismatches (major)” and “Mismatches (minor)” to help you focus on entries that need attention.

Field Comparison

Refcheck compares these fields when available in both local and remote:

FieldComparison Method
titleNormalized string match (ignores case, punctuation, diacritics)
authorsCount match + first/last name comparison for each author
yearExact numeric match
venueNormalized string match (journal/booktitle/conference)
doiNormalized DOI match (strips https://doi.org/ prefix)

Normalization

String comparisons use normalization to handle minor differences:

When strings don’t match, refcheck reports the edit distance (Levenshtein distance) so you can gauge how different they are.

The check_log Field

After refcheck runs, the entry’s frontmatter is updated with a check_log field containing:

check_log:
  checked_at: "2024-01-15T10:30:00.000Z"
  checked_with: dblp
  status: mismatch
  paper_id: conf/neurips/SmithJ23

  fields:
    title:
      local: "Deep Learing for NLP"
      remote: "Deep Learning for NLP"
      match: false
      edit_distance: 2

    authors:
      local_count: 2
      remote_count: 2
      count_match: true

    year:
      local: "2023"
      remote: "2024"
      match: false
      year_diff: 1

  remote:
    title: "Deep Learning for NLP"
    authors:
      - "John Smith"
      - "Jane A. Doe"
    year: 2024
    venue: "Advances in Neural Information Processing Systems"
    doi: "10.1234/example"

  external_bibtex:
    source: dblp
    bibtex: |
      @inproceedings{DBLP:conf/neurips/SmithJ23,
        author    = {John Smith and Jane A. Doe},
        title     = {Deep Learning for NLP},
        booktitle = {NeurIPS},
        year      = {2024}
      }
    fetched_at: "2024-01-15T10:30:00.000Z"

Using check_log to Fix Entries

The remote section contains the provider’s values, making it easy to copy the correct data:

# Before (with typo)
title: "Deep Learing for NLP"

# After (corrected from check_log.remote.title)
title: "Deep Learning for NLP"

The external_bibtex section (when available from DBLP) provides a complete BibTeX entry you can use as a reference.

Best Practices

  1. Run with --dry-run first to see what would be changed without modifying files

  2. Refcheck by project to focus on specific reference sets:

    bun run cli -- refcheck shared-references --dry-run
    
  3. Use --force sparingly - only re-check when you’ve made corrections or want fresh data

  4. Review mismatches carefully - not all differences are errors:

    • Venue names vary (abbreviations vs full names)
    • Author middle names may differ
    • Year might be publication vs online date
  5. Trust but verify - external APIs aren’t perfect:

    • DBLP focuses on CS, may not have other disciplines
    • OpenAlex has broader coverage but may have less accurate metadata

Troubleshooting

No bibtex entries found to refcheck

Entries must have type: bibtex_entry in their frontmatter to be checked.

”Paper not found”

Refcheck uses DOI (if available) or title to search. If neither produces a match:

Rate Limiting

Refcheck includes automatic rate limiting (250ms between API calls) to avoid overloading providers. For large batches, use --limit to refcheck in smaller groups, or use --skip/--start-from to resume interrupted runs.

Re-checking and Staleness

How Re-checking Works

By default, the refcheck command skips entries that already have a check_log. This prevents unnecessary API calls and preserves your verification history.

To re-check entries:

# Re-check all entries, overwriting existing check_log
bun run cli -- refcheck --force

# Re-check a specific file
bun run cli -- refcheck --file references/smith-2024.md --force

When to Re-check

Re-check entries when:

Stale Checks

Checks older than 30 days are marked as stale. This doesn’t mean they’re wrong - external databases rarely change existing records. Staleness is a soft reminder that you might want fresh data.

# Example stale check_log
check_log:
  checked_at: "2024-10-15T10:30:00.000Z"  # More than 30 days ago
  status: stale  # Automatically computed, not stored

Auto Mode Provider Selection

The CLI auto provider (default) tries providers in order until one succeeds:

  1. DBLP - Tried first, best for computer science
  2. Crossref - DOI-based lookup, good for published works
  3. Semantic Scholar - Broad academic coverage
  4. OpenAlex - Broader coverage across fields

The clipper evaluates DBLP and Semantic Scholar (plus optional OpenAlex/Crossref) and picks the most complete result rather than following a fixed order.

“Succeeds” means the provider found a matching paper - even if fields mismatch. The CLI stops at the first provider that returns a result, so:

The checked_with field records which provider was used:

check_log:
  checked_with: "auto:dblp"      # Auto mode, matched in DBLP
  checked_with: "auto:crossref"  # Auto mode, matched in Crossref
  checked_with: "auto:s2"        # Auto mode, matched in Semantic Scholar
  checked_with: "auto:openalex"  # Auto mode, matched in OpenAlex
  checked_with: "dblp"           # Explicit provider selection

Viewing Refcheck Status

In the Web App

On each reference’s detail page, you’ll see a verification badge showing:

BadgeMeaning
✓ ConfirmedAll checked fields match the external database
⚠ Needs review - N fields differSome fields differ (see list below)
? Not foundPaper wasn’t in the database
✗ Refcheck failedAPI error occurred
↻ StaleRefcheck is older than 30 days
○ Not yet checkedEntry hasn’t been verified

For mismatches, the badge shows which fields differ (title, authors, year, venue, doi). Scroll down to compare the “Local Entry” and “From [source]” BibTeX side-by-side.

In the Markdown Files

The full details are stored in the check_log frontmatter field:

check_log:
  checked_at: "2024-01-15T10:30:00.000Z"
  checked_with: "auto:dblp"
  status: mismatch
  paper_id: conf/neurips/SmithJ23
  fields:
    title:
      local: "Deep Learing for NLP"
      remote: "Deep Learning for NLP"
      match: false
      edit_distance: 2
    year:
      local: "2023"
      remote: "2024"
      match: false
  remote:
    title: "Deep Learning for NLP"
    authors: ["John Smith", "Jane Doe"]
    year: 2024

Interpreting Mismatches

Not all mismatches are errors. Common benign differences:

FieldCommon Reason
venueAbbreviation vs full name (“NeurIPS” vs “Advances in Neural Information Processing Systems”)
authorsMiddle initials, name order, diacritics (“François” vs “Francois”)
yearOnline publication date vs print date
titleTrailing period in database (“Title.” vs “Title”)

The edit_distance helps gauge severity - distance of 1-2 usually indicates minor formatting differences.

Manual Verification

For entries that can’t be automatically verified (books, technical reports, websites), add a manually_verified field:

manually_verified:
  verified_at: "2024-12-28T00:00:00.000Z"
  verified_by: human
  notes: "Verified against publisher website"
canonical_source:
  url: "https://example.com/book"
  title: "The Book Title"
  accessed_at: "2024-12-28"

This will display a “Manually verified” badge in the web app.