Tag: ocr

Converting Books to JSON: A Digital Humanities Project

Apr 6, 2024

—

by

Nathan Graham

in Data

This post discusses a recent project where I scanned PDF issues of the Council of Literary Magazines and Presses (CLMP) Directory of Literary Magazines from 1995 to 2005 and converted those 10 directories into clean, well-structured JSON. The process encompassed several stages, including PDF to text conversion, data cleaning, and data extraction using Python scripts…