The Magic Wand Search: Mapping Internet Law with LLMs and Web Scraping

This post was autogenerated from my research notes and is a working draft.

There is no magic wand search that will generate a ranking of most-cited scholars based on citations to articles and citations to books.

— Fred Shapiro, "The Most-Cited Legal Scholars Revisited", University of Chicago Law Review (2021)

There is now.

In 2021, Fred Shapiro — the godfather of legal citation studies — wrote that no tool existed to automatically rank legal scholars by their citation counts across both articles and books.¹ He was right, at the time. Legal scholarship is a bibliometric nightmare: no DOIs, no structured metadata, citations buried in footnotes inside PDFs and DOCX files, and repositories that can't agree on what they even contain.

But Shapiro wrote that before large language models got good at parsing messy, unstructured text. This series is about what happens when you point LLMs at the data problem he described — and start scraping, cleaning, and assembling the dataset that wasn't supposed to exist.

This is Part 1: the data problem, and how I'm building the pipeline to solve it.

The Question Behind the Data

I originally picked up this work because I was interested in testing a deceptively simple question about Internet Law scholarship: does it actually cite computer science? Do the citation networks show genuine interdisciplinary exchange, or do they confirm Judge Easterbrook's famous worry that the field talks mostly to itself?²

That question led me down the rabbit hole of trying to do large-scale bibliometrics on U.S. legal scholarship — and discovering just how broken the infrastructure is for anyone who tries.

Why Legal Scholarship Breaks Every Bibliometric Tool

If you've worked with citation data in the sciences, you might assume this is a solved problem. It is not. As Nunna, Price, and Tietz put it bluntly: "U.S. legal scholarship is weird."³

Here's why:

No identifiers. Scientific papers have DOIs. Legal articles generally don't.⁴ There's no universal identifier linking an article to its citations, which means you can't easily plug into the network science tools that resolve citation graphs in other fields.

No structured metadata. Law journal websites and third-party repositories share articles as PDFs and DOCX files. Citations live in footnotes — sometimes hundreds of them per article — formatted in Bluebook style, a citation system designed for human readers, not machines. Separating citations from prose is a parsing problem with no off-the-shelf solution.

Incomplete repositories. The major databases — Westlaw, HeinOnline, Web of Science, Google Scholar — are each incomplete in different ways. One study comparing them found that every repository was missing some subset of journals, treatises, or citation counts.⁵ Even HeinOnline, which maintains a complete archive of student-run law journals, has citation count issues due to how it scans footnotes.

Student-run journals. Unlike most academic disciplines, law reviews are edited by students, not peer-reviewed by scholars. This creates a publishing ecosystem with hundreds of journals, inconsistent formatting standards, and no centralized metadata infrastructure.

The result: computational network analysis is, as Hayashi writes, "only beginning to get a foothold in legal scholarship."⁶ The data woes are often fatal to study attempts before they start.

Assembling the Corpus

Since that original paper, the dataset has grown by an order of magnitude. What started as one donated collection of 36,000 articles is now a corpus of over 205,000 articles drawn from three independent sources, each acquired differently.

Source 1: Web-Scraped PDFs (~122,000 files)

The largest source is also the messiest. I built automated scrapers for 160+ law review journal websites using a pipeline I call offprint. It discovers article URLs from journal sitemaps, downloads the PDFs directly, and extracts whatever HTML metadata is available at discovery time — title, authors, volume, issue, year, DOI, and abstract when the journal provides them.

Offprint — automated scraping pipeline for law review journals

Law reviews run on a patchwork of publishing platforms — Digital Commons/BePress, Open Journal Systems, Scholastica, WordPress, and dozens of custom sites — so each scraper has to handle a different structure. Of the 1,451 journals in the registry, 432 are actively scraped, 643 lack a usable sitemap, and the rest are paused due to WAFs, login walls, paywalls, or 404s. The Berkeley Technology Law Journal alone yielded over 6,000 PDFs; Harvard Law Review, about 3,000.

Source 2: Data Donation (~62,000 articles)

The foundation of the earlier work was a LexisNexis law journal archive generously donated by Professor Nicholson Price at the University of Michigan.⁷ It consists of 821 DOCX files covering 252 journals from roughly 2000 to 2020. Each DOCX represents a full journal issue; I split them programmatically into individual articles using header boundaries and extracted footnotes directly from Word's footnotes.xml structure.

Price's team originally identified about 36,000 articles in this archive; my splitting pipeline finds roughly 62,000. The difference isn't a data quality issue on their end — their work focused on long-form articles and used splitting heuristics tuned for that purpose. My pipeline casts a wider net, picking up shorter pieces like book reviews, essays, commentaries, and symposium contributions that their boundaries didn't target. For a bibliometric study, these shorter forms are worth including: they cite and are cited, and they're part of the scholarly conversation even if they aren't full-length articles.

This is the richest source for footnote text — it produced 11.6 million footnotes — and is the primary input for the labeling pipeline. The Free Law Project's eyecite library handles extracting and resolving legal citations within those footnotes.⁸

Source 3: Anna's Archive (~1,300 PDFs)

The third source extends the historical range. I scanned 12 Elasticsearch shards from Anna's Archive — 763 million records — against the journal registry using ISSN and phrase-based matching. This surfaced roughly 19,000 filtered matches across 236 journals, with dates ranging from 1888 to 2023. Most are full journal issues rather than individual articles, sourced from Internet Archive (63%) and HathiTrust (28%). So far I've downloaded about 1,300 PDFs, concentrated in Harvard Law Review, California Law Review, Michigan Law Review, and Chicago Law Review.

Putting It Together

A SQLite provenance catalog cross-indexes all three sources:

| Source | Articles | Footnotes | |--------|----------|-----------| | Data donation | 62,055 | 11.6M | | Web-scraped PDFs | ~122,000 | extraction in progress | | Anna's Archive | ~1,300 | not yet extracted | | Combined | 205,768 | |

Here's how the full pipeline fits together, from raw sources to labeled data:

 ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
 │   Web Scraping   │  │  Data Donation   │  │ Anna's Archive  │
 │  122K PDFs from  │  │  821 DOCX files  │  │  763M records   │
 │  160+ journals   │  │  252 journals    │  │  scanned by     │
 │  (offprint)      │  │  (LexisNexis)    │  │  ISSN matching  │
 └────────┬─────────┘  └────────┬─────────┘  └────────┬────────┘
          │                     │                      │
          ▼                     ▼                      ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                  Provenance Catalog (SQLite)                │
 │                    205,768 articles                         │
 └────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                   Footnote Extraction                       │
 │                                                             │
 │  DOCX ──→ Word XML parsing ──→ clean footnote text         │
 │                                                             │
 │  PDF ──→ Docling / pdfplumber ──→ OCR fallback ──→         │
 │       ──→ segmentation ──→ issue splitting ──→ QC          │
 └────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                   Citation Parsing                          │
 │         eyecite: case law, statutes, regulations            │
 │              ~14M structured citations                      │
 └────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                  LLM Labeling Pipeline                      │
 │        teacher–student architecture on 2 GPU servers        │
 │                                                             │
 │  Per footnote:                                              │
 │    → footnote function (7 types)                            │
 │    → citation type (17 types)                               │
 │    → citation function (8 types)                            │
 │    → claim type (5 types)                                   │
 │                                                             │
 │           11.6M footnotes → structured JSON                 │
 └────────────────────────────┬────────────────────────────────┘
                              │
                              ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                      Analyses                               │
 │  author rankings · citation networks · interdisciplinary    │
 │  trends · link rot · citation function breakdowns           │
 └─────────────────────────────────────────────────────────────┘

Extracting Footnotes at Scale

For the donated DOCX files, footnote extraction is straightforward — Word's XML structure gives you clean footnote text. The scraped PDFs are a different story. The pipeline works in stages:

Text extraction via Docling, with fallback to pdfplumber for difficult files
OCR fallback for scanned PDFs, routed through a vLLM-based model when structured extraction fails
Footnote segmentation using heuristics that detect ordinal numbering patterns, handle page breaks, and merge cross-page continuations
Issue splitting to separate multi-article PDFs (full journal issues) into individual articles before footnote attribution
QC filtering to reject non-article content — tables of contents, mastheads, editorial boards — and flag anomalies like ordinal gaps or suspiciously short footnote sections
Citation enrichment via eyecite, adding structured citation metadata to each footnote

The LLM Labeling Pipeline

With 11.6 million footnotes from the donation corpus alone, I needed more than simple citation extraction. I wanted to know how each citation is used — is it substantive discussion or just an authority string? Does it support the author's claim or argue against it? Is it citing a law review article, a book, a government report, or a newspaper?

To answer these questions at scale, I built a two-tier teacher–student labeling architecture. A fine-tuned LLM processes each footnote and returns a structured JSON label classifying it across multiple dimensions.⁹

The taxonomy covers:

Footnote function (7 types): authority string, substantive discussion, cross-reference, source attribution, definitional, acknowledgment, methodological note
Citation type (17 types): case law, statute, regulation, constitution, law review article, book/treatise, newspaper, website, government report, and more
Citation function (8 types): supporting, see generally, contrary, comparing, defining, quoting, citing data, attribution
Claim type (5 types): factual support, legal authority, idea attribution, empirical evidence, descriptive aside

As of this writing, 632,000 of the 11.6 million footnotes have been labeled — about 5.3% — with a 99.6% JSON validity rate. At the current throughput of roughly 17 footnotes per second across two GPU servers, the full run will take about another week.

This is the "magic wand" that Shapiro said didn't exist. Not a single search, but a pipeline: scraping, parsing, LLM classification, citation extraction, and structured labeling, all stitched together. It lets you ask questions that were previously intractable — like ranking the most-cited scholars not just by count, but by how they're cited.

What the Data Already Shows

Even with only 5% of footnotes labeled, patterns are emerging.

The most-cited scholars. Cass Sunstein leads with 283 citing articles, followed by the American Law Institute (282) and Richard Posner (240). The pipeline tracks per-author citation trajectories by year, showing whose influence rises and falls over time.

The most-cited works. Black's Law Dictionary (2014 ed.) dominates with 332 citations. The pipeline also measures citation longevity — the span between a work's first and last citation in the corpus.

Interdisciplinary trends. Overall, 27.7% of citations are non-legal (other academic journals, news, websites, government reports). That share has been rising — from roughly 15% in 2015 to 35% by 2020.

How citations are used. 46.9% of footnotes are pure authority strings — just citations, no discussion. Only 3.5% contain substantive discussion. A mere 0.9% are adversarial (citing contrary authority or comparing positions). 3.3% cite empirical data. The picture is one of legal scholarship that overwhelmingly cites to support, rarely to engage.

Link rot. 17% of footnotes contain a URL. Perma.cc dominates at 26% of all URLs, indicating strong archival practice among law reviews. The most-cited news sources are the New York Times, Washington Post, and Wall Street Journal.

And on the original question that started this project: in my earlier analysis of the donation corpus, only 10% of articles classified as Science, Technology, and Computing Law cited a computer science publication. When they did, it was almost always authors with technical backgrounds — computer scientists and PhD holders who brought that training into law.¹⁰

What's Next

In Part 2, I'll cover the network analysis: training embedding models on citation pairs to map where Internet Law journals fall on a spectrum from pure law to computer science, and what that placement tells us about the field's intellectual community. I'm experimenting with SPECTER2 — purpose-built for scientific scholarship and trained on 1.1 million law articles — to replace the Word2Vec approach from my original study.

But the labeling pipeline still has a week to run. By Part 2, I'll have 11.6 million labeled footnotes to work with instead of 632,000. Shapiro was right that no one search could do it. But nobody said it had to be one search.