Refcheck
Refcheck verifies your bibliographic references against external academic databases to catch errors, find missing metadata, and ensure citation accuracy.
Overview
The refcheck command looks up your bibtex entries in external APIs and compares the metadata. For each entry, it:
- Searches the provider’s API using DOI or title
- Compares local metadata against the remote source
- Reports matches and mismatches at the field level
- Stores detailed results in the
check_logfrontmatter field
Refcheck Workflow (CLI + Clipper)
Refcheck runs in two places: the CLI and the browser clipper extension. A typical flow is:
- Clip a paper with the browser extension to create a
bibtex_entry - Pick the best source in the clipper (DBLP/Semantic Scholar by default; OpenAlex and Crossref optional)
- Validate with
refcheckin the CLI or use the clipper’s validation mode - Fix mismatches using values stored in
check_log
If you’re using the clipper in download mode, move the saved markdown file into your vault before running refcheck. See extensions/clipper/README.md for installation and API mode setup.
Usage
# Check all bibtex entries (uses 'auto' provider by default)
bun run cli -- refcheck
# Check entries in a specific project
bun run cli -- refcheck shared-references
# Check a single file
bun run cli -- refcheck --file references/smith-2024.md
# Preview without updating files
bun run cli -- refcheck --dry-run
# Re-check entries that were already checked
bun run cli -- refcheck --force
# Use a specific provider
bun run cli -- refcheck --provider dblp
bun run cli -- refcheck --provider openalex
# Check only entries matching a path pattern
bun run cli -- refcheck --filter "references/2024/*"
# Limit the number of entries to check
bun run cli -- refcheck --limit 10
# Resume from a specific entry (if interrupted)
bun run cli -- refcheck --start-from smith2024
# Skip first N entries (alternative resume method)
bun run cli -- refcheck --skip 50
# List available providers
bun run cli -- refcheck --list-providers
Providers
CLI providers
The CLI refcheck command supports:
auto(default): DBLP → Crossref → Semantic Scholar → OpenAlexdblpcrossrefs2(Semantic Scholar)openalex
Clipper extension sources
The browser clipper searches DBLP and Semantic Scholar by default, with optional OpenAlex and Crossref toggles. When multiple sources return results, the clipper auto-selects the best match based on completeness rather than a fixed order.
dblp (CLI + clipper)
DBLP is the computer science bibliography database. Best for:
- Computer science conference papers
- CS journal articles
- Papers with accurate BibTeX entries
DBLP provides high-quality BibTeX that can be used to correct your entries.
s2 (Semantic Scholar, CLI + clipper)
Semantic Scholar provides broad academic coverage with good abstracts. Best for:
- Papers across all academic fields
- Works with arXiv IDs or DOIs
- Access to citation data and abstracts
openalex (CLI + clipper)
OpenAlex is an open catalog of 200M+ scholarly works. Best for:
- Papers across all academic disciplines
- Works with DOIs
- Comprehensive coverage with links to other databases
crossref (CLI + clipper)
Crossref is the official DOI metadata registry. Best for:
- Any DOI-registered work
- Authoritative DOI metadata
- Fallback when other providers don’t have the paper
Refcheck Status
Each checked entry receives one of these statuses:
| Status | Meaning |
|---|---|
confirmed | All checked fields match the remote source |
mismatch | One or more fields differ from the remote source |
not_found | Paper was not found in the provider’s database |
error | API error or processing error occurred |
stale | Refcheck is older than 30 days (re-validate for freshness) |
unchecked | Entry has never been validated |
Mismatch Severity
When a mismatch is detected, refcheck also classifies its severity to help you prioritize review:
| Severity | Meaning | Examples |
|---|---|---|
minor | Likely a false positive, probably OK | Venue: “NeurIPS 2023” vs “arXiv”; Author initials differ; Year off by 1 |
major | Needs human review | Wrong authors (book review matched); DOI mismatch; Title differs |
Minor Mismatches (Likely OK)
These patterns are common and usually don’t indicate real errors:
- Venue variations: Conference name vs arXiv preprint (e.g., “ICML 2023” vs “arXiv (Cornell University)”)
- Author initials: “John Smith” vs “J. Smith” or “John D. Smith”
- Year ±1: Preprint date vs publication date
Major Mismatches (Need Review)
These patterns often indicate real issues:
- Wrong authors: Last names don’t match (may have matched a book review instead of the book)
- DOI mismatch: Different DOIs mean different papers
- Title mismatch: Likely matched the wrong paper entirely
- Book → journal: Publisher venue matched to a journal (book review problem)
The CLI shows severity in its output:
⚠ [mismatch:minor] smith2023 # Yellow - probably OK
⚠ [mismatch:major] zuboff2019 # Red - needs review
The web app Review page has filters for “Mismatches (major)” and “Mismatches (minor)” to help you focus on entries that need attention.
Field Comparison
Refcheck compares these fields when available in both local and remote:
| Field | Comparison Method |
|---|---|
title | Normalized string match (ignores case, punctuation, diacritics) |
authors | Count match + first/last name comparison for each author |
year | Exact numeric match |
venue | Normalized string match (journal/booktitle/conference) |
doi | Normalized DOI match (strips https://doi.org/ prefix) |
Normalization
String comparisons use normalization to handle minor differences:
- Case insensitive (
Deep Learning=deep learning) - Ignores diacritics (
Müller=Muller) - Ignores punctuation (
What's Next?=Whats Next) - Collapses whitespace
When strings don’t match, refcheck reports the edit distance (Levenshtein distance) so you can gauge how different they are.
The check_log Field
After refcheck runs, the entry’s frontmatter is updated with a check_log field containing:
check_log:
checked_at: "2024-01-15T10:30:00.000Z"
checked_with: dblp
status: mismatch
paper_id: conf/neurips/SmithJ23
fields:
title:
local: "Deep Learing for NLP"
remote: "Deep Learning for NLP"
match: false
edit_distance: 2
authors:
local_count: 2
remote_count: 2
count_match: true
year:
local: "2023"
remote: "2024"
match: false
year_diff: 1
remote:
title: "Deep Learning for NLP"
authors:
- "John Smith"
- "Jane A. Doe"
year: 2024
venue: "Advances in Neural Information Processing Systems"
doi: "10.1234/example"
external_bibtex:
source: dblp
bibtex: |
@inproceedings{DBLP:conf/neurips/SmithJ23,
author = {John Smith and Jane A. Doe},
title = {Deep Learning for NLP},
booktitle = {NeurIPS},
year = {2024}
}
fetched_at: "2024-01-15T10:30:00.000Z"
Using check_log to Fix Entries
The remote section contains the provider’s values, making it easy to copy the correct data:
# Before (with typo)
title: "Deep Learing for NLP"
# After (corrected from check_log.remote.title)
title: "Deep Learning for NLP"
The external_bibtex section (when available from DBLP) provides a complete BibTeX entry you can use as a reference.
Best Practices
-
Run with
--dry-runfirst to see what would be changed without modifying files -
Refcheck by project to focus on specific reference sets:
bun run cli -- refcheck shared-references --dry-run -
Use
--forcesparingly - only re-check when you’ve made corrections or want fresh data -
Review mismatches carefully - not all differences are errors:
- Venue names vary (abbreviations vs full names)
- Author middle names may differ
- Year might be publication vs online date
-
Trust but verify - external APIs aren’t perfect:
- DBLP focuses on CS, may not have other disciplines
- OpenAlex has broader coverage but may have less accurate metadata
Troubleshooting
No bibtex entries found to refcheck
Entries must have type: bibtex_entry in their frontmatter to be checked.
”Paper not found”
Refcheck uses DOI (if available) or title to search. If neither produces a match:
- Verify the DOI is correct
- Try a different provider
- The paper may not be in the database
Rate Limiting
Refcheck includes automatic rate limiting (250ms between API calls) to avoid overloading providers. For large batches, use --limit to refcheck in smaller groups, or use --skip/--start-from to resume interrupted runs.
Re-checking and Staleness
How Re-checking Works
By default, the refcheck command skips entries that already have a check_log. This prevents unnecessary API calls and preserves your verification history.
To re-check entries:
# Re-check all entries, overwriting existing check_log
bun run cli -- refcheck --force
# Re-check a specific file
bun run cli -- refcheck --file references/smith-2024.md --force
When to Re-check
Re-check entries when:
- You’ve corrected metadata (fixed a typo, added DOI)
- The original refcheck failed (
errorstatus) - The refcheck is stale (older than 30 days)
- You want to try a different provider
Stale Checks
Checks older than 30 days are marked as stale. This doesn’t mean they’re wrong - external databases rarely change existing records. Staleness is a soft reminder that you might want fresh data.
# Example stale check_log
check_log:
checked_at: "2024-10-15T10:30:00.000Z" # More than 30 days ago
status: stale # Automatically computed, not stored
Auto Mode Provider Selection
The CLI auto provider (default) tries providers in order until one succeeds:
- DBLP - Tried first, best for computer science
- Crossref - DOI-based lookup, good for published works
- Semantic Scholar - Broad academic coverage
- OpenAlex - Broader coverage across fields
The clipper evaluates DBLP and Semantic Scholar (plus optional OpenAlex/Crossref) and picks the most complete result rather than following a fixed order.
“Succeeds” means the provider found a matching paper - even if fields mismatch. The CLI stops at the first provider that returns a result, so:
- A CS paper will typically be checked against DBLP
- A paper with a DOI might match in Crossref
- Other papers might fall through to Semantic Scholar or OpenAlex
The checked_with field records which provider was used:
check_log:
checked_with: "auto:dblp" # Auto mode, matched in DBLP
checked_with: "auto:crossref" # Auto mode, matched in Crossref
checked_with: "auto:s2" # Auto mode, matched in Semantic Scholar
checked_with: "auto:openalex" # Auto mode, matched in OpenAlex
checked_with: "dblp" # Explicit provider selection
Viewing Refcheck Status
In the Web App
On each reference’s detail page, you’ll see a verification badge showing:
| Badge | Meaning |
|---|---|
| ✓ Confirmed | All checked fields match the external database |
| ⚠ Needs review - N fields differ | Some fields differ (see list below) |
| ? Not found | Paper wasn’t in the database |
| ✗ Refcheck failed | API error occurred |
| ↻ Stale | Refcheck is older than 30 days |
| ○ Not yet checked | Entry hasn’t been verified |
For mismatches, the badge shows which fields differ (title, authors, year, venue, doi). Scroll down to compare the “Local Entry” and “From [source]” BibTeX side-by-side.
In the Markdown Files
The full details are stored in the check_log frontmatter field:
check_log:
checked_at: "2024-01-15T10:30:00.000Z"
checked_with: "auto:dblp"
status: mismatch
paper_id: conf/neurips/SmithJ23
fields:
title:
local: "Deep Learing for NLP"
remote: "Deep Learning for NLP"
match: false
edit_distance: 2
year:
local: "2023"
remote: "2024"
match: false
remote:
title: "Deep Learning for NLP"
authors: ["John Smith", "Jane Doe"]
year: 2024
Interpreting Mismatches
Not all mismatches are errors. Common benign differences:
| Field | Common Reason |
|---|---|
venue | Abbreviation vs full name (“NeurIPS” vs “Advances in Neural Information Processing Systems”) |
authors | Middle initials, name order, diacritics (“François” vs “Francois”) |
year | Online publication date vs print date |
title | Trailing period in database (“Title.” vs “Title”) |
The edit_distance helps gauge severity - distance of 1-2 usually indicates minor formatting differences.
Manual Verification
For entries that can’t be automatically verified (books, technical reports, websites), add a manually_verified field:
manually_verified:
verified_at: "2024-12-28T00:00:00.000Z"
verified_by: human
notes: "Verified against publisher website"
canonical_source:
url: "https://example.com/book"
title: "The Book Title"
accessed_at: "2024-12-28"
This will display a “Manually verified” badge in the web app.