Epstein Files - Complete Dataset Audit Report

Arthas@lemmy.world · 4 days ago

Epstein Files - Complete Dataset Audit Report

Generated: 2026-02-16 | Scope: Datasets 1–12 (VOL00001–VOL00012) | Total Size: ~220 GB

Background

The Epstein Files consist of 12 datasets of court-released documents, each containing PDF files identified by EFTA document IDs. These datasets were collected from links shared throughout this Lemmy thread, with Dataset 9 cross-referenced against a partial copy we had downloaded independently.

Each dataset includes OPT/DAT index files — the official Opticon load files used in e-discovery — which serve as the authoritative manifest of what each dataset should contain. This audit was compiled to:

Verify completeness — compare every dataset against its OPT index to identify missing files
Validate file integrity — confirm that all files are genuinely the file types they claim to be, not just by extension but by parsing their internal structure
Detect duplicates — identify any byte-identical files within or across datasets
Generate checksums — produce SHA256 hashes for every file to enable downstream integrity verification

Executive Summary

Metric	Value
Total Unique Files	1,380,939
Total Document IDs (OPT)	2,731,789
Missing Files	25 (Dataset 9 only)
Corrupt PDFs	3 (Dataset 9 only)
Duplicates (intra + cross-dataset)	0
Mislabeled Files	0
Overall Completeness	99.998%

Dataset Overview

                      EPSTEIN FILES - DATASET SUMMARY
  ┌─────────┬──────────┬───────────┬──────────┬─────────┬─────────┬─────────┐
  │ Dataset │  Volume  │   Files   │ Expected │ Missing │ Corrupt │  Size   │
  ├─────────┼──────────┼───────────┼──────────┼─────────┼─────────┼─────────┤
  │    1    │ VOL00001 │    3,158  │   3,158  │    0    │    0    │  2.5 GB │
  │    2    │ VOL00002 │      574  │     574  │    0    │    0    │  633 MB │
  │    3    │ VOL00003 │       67  │      67  │    0    │    0    │  600 MB │
  │    4    │ VOL00004 │      152  │     152  │    0    │    0    │  359 MB │
  │    5    │ VOL00005 │      120  │     120  │    0    │    0    │   62 MB │
  │    6    │ VOL00006 │       13  │      13  │    0    │    0    │   53 MB │
  │    7    │ VOL00007 │       17  │      17  │    0    │    0    │   98 MB │
  │    8    │ VOL00008 │   10,595  │  10,595  │    0    │    0    │   11 GB │
  │    9    │ VOL00009 │  531,282  │ 531,307  │   25    │    3    │   96 GB │
  │   10    │ VOL00010 │  503,154  │ 503,154  │    0    │    0    │   82 GB │
  │   11    │ VOL00011 │  331,655  │ 331,655  │    0    │    0    │   27 GB │
  │   12    │ VOL00012 │      152  │     152  │    0    │    0    │  120 MB │
  ├─────────┼──────────┼───────────┼──────────┼─────────┼─────────┼─────────┤
  │  TOTAL  │          │1,380,939  │1,380,964 │   25    │    3    │ ~220 GB │
  └─────────┴──────────┴───────────┴──────────┴─────────┴─────────┴─────────┘

Notes

DS1: Two identical copies found (6,316 files on disk). Byte-for-byte identical via SHA256. Table above reflects one copy (3,158). One copy is redundant.
DS2: 699 document IDs map to 574 files (multi-page PDFs)
DS3: 1,847 document IDs across 67 files (~28 pages/doc avg)
DS5: 1:1 document-to-file ratio (single-page PDFs)
DS6: Smallest dataset by file count. ~37 pages/doc avg.
DS9: Largest dataset. 25 missing from OPT index, 3 structurally corrupt.
DS10: Second largest. 950,101 document IDs across 503,154 files.
DS11: Third largest. 517,382 document IDs across 331,655 files.

Dataset 9 — Missing Files (25)

EFTA00709804    EFTA00823221    EFTA00932520
EFTA00709805    EFTA00823319    EFTA00932521
EFTA00709806    EFTA00877475    EFTA00932522
EFTA00709807    EFTA00892252    EFTA00932523
EFTA00770595    EFTA00901740    EFTA00984666
EFTA00774768    EFTA00912980    EFTA00984668
EFTA00823190    EFTA00919433    EFTA01135215
EFTA00823191    EFTA00919434    EFTA01135708
EFTA00823192

Dataset 9 — Corrupted Files (3)

File	Size	Error
`EFTA00645624.pdf`	35 KB	Missing trailer dictionary, broken xref table
`EFTA01175426.pdf`	827 KB	Invalid xref entries, no page tree (0 pages)
`EFTA01220934.pdf`	1.1 MB	Missing trailer dictionary, broken xref table

Valid %PDF- headers but cannot be rendered due to structural corruption. Likely corrupted during original document production or transfer.

File Type Verification

Two levels of verification performed on all 1,380,939 files:

Magic Byte Detection (file command) — All files contain valid %PDF- headers. 0 mislabeled.
Deep PDF Validation (pdfinfo, poppler 26.02.0) — Parsed xref tables, trailer dictionaries, and page trees. 3 structurally corrupt (Dataset 9 only).

Duplicate Analysis

Within Datasets: 0 intra-dataset hash duplicates across all 12 datasets.
Cross-Dataset: All 1,380,939 SHA256 hashes compared. 0 cross-dataset duplicates — every file is unique.
Dataset 1 Two Copies: Both copies byte-for-byte identical (SHA256 verified). One is redundant (~2.5 GB).

Integrity Verification

SHA256 checksums were generated for every file across all 12 datasets. Individual checksum files are available per dataset:

File	Hashes	Size
`dataset_1_SHA256SUMS.txt`	3,158	256 KB
`dataset_2_SHA256SUMS.txt`	574	47 KB
`dataset_3_SHA256SUMS.txt`	67	5.4 KB
`dataset_4_SHA256SUMS.txt`	152	12 KB
`dataset_5_SHA256SUMS.txt`	120	9.7 KB
`dataset_6_SHA256SUMS.txt`	13	1.1 KB
`dataset_7_SHA256SUMS.txt`	17	1.4 KB
`dataset_8_SHA256SUMS.txt`	10,595	859 KB
`dataset_9_SHA256SUMS.txt`	531,282	42 MB
`dataset_10_SHA256SUMS.txt`	503,154	40 MB
`dataset_11_SHA256SUMS.txt`	331,655	26 MB
`dataset_12_SHA256SUMS.txt`	152	12 KB

To verify any file against its checksum:

shasum -a 256 <filename>

If you’d like access to the SHA256 checksum files or can help host them, send me a DM.

Methodology

Hash Generation: SHA256 checksums via shasum -a 256 with 8-thread parallel processing
OPT Index Comparison: Each dataset’s OPT load file parsed for expected file paths, compared against files on disk
Intra-Dataset Duplicate Detection: SHA256 hashes compared within each dataset
Cross-Dataset Duplicate Detection: All 1,380,939 hashes compared across all 12 datasets
File Type Verification (Level 1): Magic byte detection via file command
Deep PDF Validation (Level 2): Structure validation via pdfinfo (poppler 26.02.0) — xref tables, trailer dictionaries, page trees
Cross-Copy Comparison: Dataset 1’s two copies compared via full SHA256 diff

Recommendations

Remove Dataset 1 duplicate copy — saves ~2.5 GB
Document the 25 missing Dataset 9 files — community assistance may help locate these
Preserve OPT/DAT index files — authoritative record of expected contents
Distribute SHA256SUMS.txt files — for downstream integrity verification

Report generated as part of the Epstein Files preservation and verification project.

Arthas@lemmy.world · 6 days ago

for DS9, does anyone have the following files:

  EFTA00709804
  EFTA00709805
  EFTA00709806
  EFTA00709807
  EFTA00770595
  EFTA00774768
  EFTA00823190
  EFTA00823191
  EFTA00823192
  EFTA00823221
  EFTA00823319
  EFTA00877475
  EFTA00892252
  EFTA00901740
  EFTA00912980
  EFTA00919433
  EFTA00919434
  EFTA00932520
  EFTA00932521
  EFTA00932522
  EFTA00932523
  EFTA00984666
  EFTA00984668
  EFTA01135215
  EFTA01135708

If so, please DM me them and then I can include them in my master archive.

Arthas@lemmy.world · 17 days ago

i analyzed with AI my 36gb~ that I was able to download before they erased the zip file from the server.

Complete Volume Analysis

  Based on the OPT metadata file, here's what VOL00009 was supposed to contain:

  Full Volume Specifications

  - Total Bates-numbered pages: 1,223,757 pages
  - Total unique PDF files: 531,307 individual PDFs
  - Bates number range: EFTA00039025 to EFTA01262781
  - Subdirectory structure: IMAGES\0001\ through IMAGES\0532\ (532 folders)
  - Expected size: ~180 GB (based on your download info)

  What You Actually Got

  - PDF files received: 90,982 files
  - Subdirectories: 91 folders (0001 through ~0091)
  - Current size: 37 GB
  - Percentage received: ~17% of the files (91 out of 532 folders)

  The Math

  Expected:  531,307 PDF files / 180 GB / 532 folders
  Received:   90,982 PDF files /  37 GB /  91 folders
  Missing:   440,325 PDF files / 143 GB / 441 folders

  ★ Insight ─────────────────────────────────────
  You got approximately the first 17% of the volume before the server deleted it. The good news is that the DAT/OPT index files are complete, so you have a full manifest of what should be there. This means:
  - You know exactly which documents are missing (folders 0092-0532)

I haven’t looked into downloading the partials from archive.org yet to see if I have any useful files that archive.org doesn’t have yet from dataset 9.

Arthas@lemmy.world · 17 days ago

I was being cut off, I manage it with chunking techniques. They unfortunately took down the file so now I have no source to pull from.

Arthas@lemmy.world · 17 days ago

I was, and that is why it was taking so long for me to download as I use my custom downloader which uses various techniques to chunk the download. Unfortunately it seems like they’ve now removed the file completely so my downloader has no source to pull from and is stopped at 36gb.

Arthas@lemmy.world · 17 days ago

some bad news, it looks like the data 9 zip file link doesn’t work anymore. They appear to have removed the file so my download stopped at 36gb. I’m not familiar with their site so is this normal for them to remove the files and maybe put them back again once they’ve reorganized them and at the same link location? or are we having to do the scrape of each pdf like another user has been doing?

Arthas@lemmy.world · 18 days ago

yeah still chugging away slowly, it may take me a few days actually, it’s quite slow but so far it appears to be getting it.

Arthas@lemmy.world · 18 days ago

I have various chunking techniques that I use. I adaptively modify the request size of the chunks as I’ve noticed at times the CDN will give large amounts then micro amounts. I haven’t figured out the exact backoff rate but I have retry mechanisms in place. The CDN is very annoying but so far my methods are working, just slow.

Arthas@lemmy.world · 18 days ago

Ok great. As for comparing files. I would likely do a hash check. That shouldn’t be difficult to identify truly unique files. It’ll take a few days for a decent computer to generate all the hashes but it should be pretty automated. I’ll reach out once I have it completed.

Arthas@lemmy.world · 18 days ago

I am downloading dataset 9 and should have the full 180gb zip done in a day. To confirm, the link on DOJ to the dataset 9 zip is now updated to be clean of CSAM or not? As much as I wish to help the cause, I do not want any of that type of material on my server unless permission has been given to host it for credible researchers only that need access to all files for their investigation, but I have no way of understanding what’s within legal rights to assist with redistributing the files to legitimate investigators and thus my plans to help create a torrent may be squashed. Please let me know.