String Normalization for Matching

When comparing academic metadata (titles, authors, venues), Extenote normalizes strings to handle variations in formatting, diacritics, and whitespace. This allows matching “Müller” with “Muller” and “HELLO World” with “hello world”.

Step 1: converts to lowercase

All comparisons are case-insensitive. “HELLO” matches “hello”.

Test: converts to lowercase File: packages/core/tests/check.test.ts:34

it("converts to lowercase", () => {
    expect(normalizeString("HELLO World")).toBe("hello world");
  });

Step 2: removes diacritics

Diacritics (accents) are removed: “Müller” becomes “muller”, “café” becomes “cafe”. This handles international author names and venue names correctly.

Test: removes diacritics File: packages/core/tests/check.test.ts:44

it("removes diacritics", () => {
    expect(normalizeString("Müller")).toBe("muller");
    expect(normalizeString("café")).toBe("cafe");
    expect(normalizeString("Dudík")).toBe("dudik");
    expect(normalizeString("naïve")).toBe("naive");
  });

Step 3: collapses whitespace

Extra whitespace, tabs, and newlines are collapsed to single spaces.

Test: collapses whitespace File: packages/core/tests/check.test.ts:55

it("collapses whitespace", () => {
    expect(normalizeString("hello   world")).toBe("hello world");
    expect(normalizeString("  hello  world  ")).toBe("hello world");
    expect(normalizeString("hello\t\nworld")).toBe("hello world");
  });

This documentation is generated from test annotations. Edit the source test file to update.