Converting Books to JSON: A Digital Humanities Project

Introduction

If you’re reading this, you are probably in a situation similar to the one I found myself in–I had a set of books that had not been digitized, and I wanted to perform document analysis across this dataset. Converting historical documents from their original formats into structured, machine-readable data is a crucial step in enabling data analysis. However, the path from print book to machine-readable data presents several challenges including digitization, optical character recognition (OCR), data cleaning, structure and formatting, data extraction, and variations and special cases across issues or volumes. This was particularly tricky for this project because each issue had 1. different standard listing fields, and 2. individual listings that included or omitted different parts of the listing information. I couldn’t manually capture all of this information without spending several weeks of painful data scraping and cleaning. I couldn’t use my typical approach to define the document structure because there was so much variation between issues and listing information.

So… this seemed like a good scenario for combining traditional digitization and computer vision approaches with emergent generative AI models.

This post discusses a recent project where I scanned PDF issues of the Council of Literary Magazines and Presses (CLMP) Directory of Literary Magazines from 1995 to 2005 and converted those 10 directories into clean, well-structured JSON. The process encompassed several stages, including PDF to text conversion, data cleaning, and data extraction using Python scripts and AI-powered tools like GPT.

Step 1: Digitization

The initial step in the project was converting the print issues into digital format. Before I go through the steps of digitizing the directories, here is a view of the 1996 issue of Directory of Literary Magazines before digitization. All 10 issues of the directory looked very similar to this issue.

In order to prepare the book for scanning, I first loosened the binding glue in order to remove the cover. I applied the hot metal on the end of my soldering iron to the spine of the book, which allowed me to non-destructively remove the cover. Next, I used a razor blade to scrape the glue from the spine.

I then pulled the pages away from the binding in sections of between 80 to 100 pages, and the removal of approximately 1/4 inch from the binding edge was performed using a paper trimmer.

The resulting trimmed pages provided me with a consistent and a clean page edge that prevented jamming the document scanner, albeit at the expense of the periodical’s original margins.

After trimming the binding edge, I manually checked pages to ensure they were unbound to other pages. I then put the pages into the document scanner (in this case, a Raven Pro Document Scanner), and set the scanning for two-sided, color, and 300 dots per inch (dpi). I used 300 dpi because this is what was recommended for scan resolution for improved accuracy of Optical Character Recognition (OCR) processing in the Adobe Acrobat Pro documentation (Adobe, 2024).

Step 2: PDF to Text Conversion

The second step in the project was to convert the scanned PDF files into plain text format. As the PDFs needed OCR (Optical Character Recognition) and text extraction, I used Azure Vision image-to-text processing because it was much more accurate than Adobe Acrobat Pro OCR (in January 2024, anyway). However, if you do not want to spend money on Azure Vision or Adobe Acrobat Pro, I have had good success with pdf2go. This process laid the foundation for the subsequent data cleaning and extraction stages.

Step 3: Data Cleaning

Once the PDFs were converted to text files, the next step involved cleaning the data to ensure its usability. This process included:

Extracting the main content by removing the foreword and indices of the original book.
Eliminating page numbers and headers that were not part of the listing data.
Manually identifying and correcting errors in the text resulting from the AI conversion process.
Utilizing a code script to remove any extra spaces in the text files.

These cleaning steps were crucial in preparing the data for the extraction phase.

Step 4: Data Extraction

The data extraction process was divided into two types: non-predefined keys and predefined keys.

4.1 Non-Predefined Keys

After the data cleaning stage, extracting data with non-predefined keys was relatively straightforward. A Python script was used to process the cleaned text files and extract the relevant information. The script, as shown in the code snippet included at the end of the article, performed the following tasks:

Reading the cleaned text file and removing any unusable characters or short numeric strings.
Defining exact match patterns to identify the beginning of each magazine entry (e.g., “Magazine:”, “Online:”, “Press:”).
Splitting the text into individual magazine entries based on the match patterns.
Extracting key-value pairs from each magazine entry, handling cases where values span multiple lines.
Storing the extracted data in a dictionary format, with each magazine entry represented as a separate object.
Converting the dictionary to a JSON string and writing it to an output file.

This process successfully extracted the non-predefined key data from the cleaned text files.

import json
import codecs

# Open the input text file containing the cleaned data from the OCR PDFs and the output file for the JSON.
IN = codecs.open("clmp-2005-2006.txt", "r", encoding="utf8")
OUT = codecs.open("clmp-2005-2006-submit.txt", "w", encoding="utf8")

# Read the input file, split it into lines, and filter out lines that are either too short (likely errors) or page numbers.
in_list = [
    s.strip()
    for s in IN.read().split("\n")
    if not (len(s.strip()) <= 1 or (len(s.strip()) <= 3 and s.isnumeric()))
]

IN.close()

datas = {}

# Patterns to identify the beginning of new entries in the text file. Define the exact matches.
patterns = ["Magazine:", "Online:", "Press:"]
match_list = []  # This list will contain the indices of lines where patterns are found.

# Loop through each line in the input list to find matches to the patterns.
for index, st in enumerate(in_list):
    match = [(True if st.find(pattern) == 0 else False) for pattern in patterns]
    if any(match):
        match_list.append(index)

length = len(in_list)
match_length = len(match_list)

print(length, match_length)  # Debugging print to check sizes.

currentIndex = 0
for i in range(0, match_length - 1):
    st, en = match_list[i], match_list[i + 1]
    currentIndex += 1
    data = {}
    idx = st
    while idx < en:
        # Split each line into a key-value pair at the first colon.
        key, value = map(str.strip, in_list[idx].split(":", 1))
        idx += 1
        # If the following line doesn't contain a colon, it's part of the previous value.
        while idx < en and ":" not in in_list[idx]:
            if not value.endswith("-"):
                value += f" {in_list[idx]}"
            else:
                value = value[:-1] + in_list[idx]
            idx += 1
        data[key] = value
    datas[currentIndex] = data

print(currentIndex)  # Debugging print to ensure the expected number of entries are processed. This must be the same as the first match_length. 

# Convert the dictionary to a JSON string, ensuring pretty printing.
datas_string = json.dumps(datas, sort_keys=False, indent=4, ensure_ascii=False)

# Write the JSON string to the output file and close it.
OUT.write(datas_string)
OUT.close()

4.2 Predefined Keys

Extracting data with predefined keys proved to be the most challenging aspect of the project. The keys were sometimes embedded within the text, requiring human analysis rather than automated script processing. This led to numerous missing values and multi-line comments.

To tackle this challenge, I integrated OpenAI API and GPT4, an AI-powered language model, was into the workflow. However, due to the large volume of data, direct processing was not feasible. Instead, the data was segmented into manageable chunks of around 70 lines each. These segments were then sent to GPT4 along with a prompt to extract the predefined keys within the specified context.

The prompt provided to GPT4 included instructions on the expected format of the extracted data (JSON) and the specific fields that might appear in the magazine listings. GPT4 analyzed each chunk of data and returned the extracted information in the desired JSON format, handling missing information and anticipating the beginning of the next article.

# Define a string 'to_system' that describes the role of the user and the type of data to be processed. This string serves as a structured set of instructions for an AI model to understand the context and the required output format.
to_system = """
You are a programmer.
...
"""

# Define a function 'evaluate_JSON' that takes a text input and returns formatted JSON.
def evaluate_JSON(text):
    # Initialize the flag, return dictionary, and title holder.
    f, ret, next_title = False, {}, ""
    
    try:
        # Call the OpenAI API client to generate completions using the provided model. Here, the conversation messages are structured to pass the 'to_system' instructions and the given 'text' as input for the AI to process.
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": to_system},
                {"role": "user", "content": f"'''\n{text}\n ... ...\n'''"},
            ],
        )

        # Process the response to extract the content, and format the JSON with newlines to make it more readable. This step assumes that the response from the AI will be in a string format that resembles JSON.
        rlt = response.choices[0].message.content
        rlt = rlt.replace("{", "\n{\n")
        rlt = rlt.replace("},", "\n},\n")

        # ... rest of the processing code should go here ...
        
    # Exception handling in case of an error during the API call or processing.
    except Exception as e:
        print(f"An error occurred: {e}")

    # The function would return the formatted JSON content.
    return ret

Conclusion

Converting scanned PDFs of the CLMP Directory of Literary Magazines into clean, well-structured JSON was a multi-step process that involved digitization of the directories, PDF to text conversion, data cleaning, and data extraction. The project utilized Python scripts and AI-powered tools like GPT to overcome challenges such as embedded keys and large data volumes.

By segmenting the data into manageable chunks and providing clear instructions to GPT4, I successfully extracted the desired information in a structured JSON format.

Result Example

I started with print issues that contained thousands of magazines that included variations of listing information like this

And I was able to turn all of the listing information across the issues into JSON like this

This project demonstrates the potential of combining traditional programming techniques with AI-powered tools to tackle complex data extraction tasks in the digital humanities field. The resulting JSON data can now be used for further analysis and research, opening up new possibilities for exploring the historical content of the CLMP Directory of Literary Magazines.