Abstract image of data being extracted from a print book and flying into a mobile computer.

Converting Books to JSON: A Digital Humanities Project

Note: This post is adapted on November 6, 2024, from my ACH2024 presentation titled, “A hybrid approach to digitizing and structuring complex print texts using AI.”

Introduction

If you’re reading this, you are probably in a situation similar to the one I found myself in–I had a set of books that had not been digitized, and I wanted to perform document analysis across this dataset. Converting historical documents from their original formats into structured, machine-readable data is a crucial step in enabling data analysis. However, the path from print book to machine-readable data presents several challenges including digitization, optical character recognition (OCR), data cleaning, structure and formatting, data extraction, and variations and special cases across issues or volumes. This was particularly tricky for this project because each issue had 1. different standard listing fields, and 2. individual listings that included or omitted different parts of the listing information. I couldn’t manually capture all of this information without spending several weeks of painful data scraping and cleaning. I couldn’t use my typical approach to define the document structure because there was so much variation between issues and listing information.

So… this seemed like a good scenario for combining traditional digitization and computer vision approaches with emergent generative AI models.

This post discusses a recent project where I scanned PDF issues of the Council of Literary Magazines and Presses (CLMP) Directory of Literary Magazines from 1995 to 2005 and converted those 10 directories into clean, well-structured JSON. The process encompassed several stages, including PDF to text conversion, data cleaning, and data extraction using Python scripts and AI-powered tools like GPT.

I cover the options for three different potential audiences: 1. Novice, 2. Intermediate, and 3. Developer.

Step 1: Planning

The initial step, regardless of your skill level, involves: 1) Physical Document Assessment, and 2) Digitization Strategy.

  1. Physical Document Assessment
    • Evaluate your document conditions and quality
    • Make note of recurring patterns and structures
    • Identify special formatting and tables
  2. Digitization Strategy
    • Defined desired output format (e.g., JSON, CSV, XML, TXT)
    • Identify data you want to extract
    • Establish data validation requirements

In my case, my documents were in good condition but were complex because they lacked a uniform set of patterns and structures between issues. This prevented me from using a single schema or template across the issues.

Step 2: Digitization

This step in the project involves converting the print documents into digital format. For this step, you will need a scanner or a smartphone with a scanning app (e.g., Adobe Scan, Microsoft Lens).

Before I go through the steps of digitizing the directories, here is a view of the 1996 issue of Directory of Literary Magazines before digitization. All 10 issues of the directory looked very similar to this issue.

Photo of a 1996 issue of CLMP Directory of Literary Magazines. It has a purple cover, and light blue and green text.

In order to prepare the book for scanning, I first loosened the binding glue in order to remove the cover. I applied the hot metal on the end of my soldering iron to the spine of the book, which allowed me to non-destructively remove the cover. Next, I used a razor blade to scrape the glue from the spine.

I then pulled the pages away from the binding in sections of between 80 to 100 pages, and the removal of approximately 1/4 inch from the binding edge was performed using a paper trimmer.

Photo of a CLMP directory on a paper cutter.

The resulting trimmed pages provided me with a consistent and a clean page edge that prevented jamming the document scanner, albeit at the expense of the periodical’s original margins.

After trimming the binding edge, I manually checked pages to ensure they were unbound to other pages. I then put the pages into the document scanner (in this case, a Raven Pro Document Scanner), and set the scanning for two-sided, color, and 300 dots per inch (dpi). I used 300 dpi because this is what was recommended for scan resolution for improved accuracy of Optical Character Recognition (OCR) processing in the Adobe Acrobat Pro documentation (Adobe, 2024).

Step 3: PDF to Text Conversion

The next step in the project is to convert the scanned PDF files into plain text format. In the scenario that we’re describing, the documents will typically be “flat” in an image format. This requires some type of OCR to extract the information. It’s at this stage, that the approach starts to diverge depending on your background and resources:

  • Novice
    Use a cloud service that allows you to upload files and define a document schema. The best option for this in my experience is Airparser, which provides multiple upload options, simple schema creation and validation, and multiple output options.
  • Intermediate
    You can use existing python libraries such as Tesseract to OCR the files. You can also purpose-built software for this such as Adobe Acrobat Pro or pdf2go.
  • Developer
    Claude recently released a beta feature called Claude PDF Support. Depending on the complexity of your documents, this API simplifies the process because you can send any PDF file to the endpoint and it will both extract the text and provide parsing. Another route is to convert the scans into base64 and send these to OpenAI API or Claude API for extraction and output.

In my case, I used Azure Vision for the OCR step. If I were to do this today, I would use Claude PDF Support. At the time I was choosing between Adobe Acrobat Pro and Azure Vision. I used Azure Vision image-to-text processing because it was much more accurate than Adobe Acrobat Pro OCR (in January 2024, anyway). However, if you do not want to spend money on Azure Vision or Adobe Acrobat Pro, I have had good success with pdf2go. This process laid the foundation for the subsequent data cleaning and extraction stages.

Step 4: Data Cleaning

Once the PDFs are converted to text files, the next step involves cleaning the data to ensure its usability. This step differs somewhat depending on experience and resources:

  • Novice
    If you use Airparser, you will need to manually spot check your output to make sure it is outputting accurately. If you notice errors, you should adjust your document template in Airparser and iterate on your prompt.
  • Intermediate & Developer
    For this step, I would recommend sending the extracted text to either Claude or OpenAI API for review and cleaning. After that step, manually review the output to ensure it is accurate. If there are accuracy errors, revise your prompt.

Regardless of which approach you take, you should provide the expected output and validation for the model to review. For example, if you are using JSON, provide the objects and values you expect to be returned and examples of the type of information that would be included as values. Use this information in your prompt to validate and set requirements.

In my project, this process included:

  1. Extracting the main content by removing the foreword and indices of the original book.
  2. Eliminating page numbers and headers that were not part of the listing data.
  3. Manually identifying and correcting errors in the text resulting from the AI conversion process.
  4. Utilizing a code script to remove any extra spaces in the text files.

These cleaning steps were crucial in preparing the data for the extraction phase.

Step 5: Data Parsing

Now that you have clean and accurate text, you can perform parsing and logic on the information. For example, say you were extracting information from academic transcripts. Once you have extracted the data, cleaned it, and have a reliable JSON output, you could then perform calculations or analysis. In the academic transcript example, this could include something like calculating “last half GPA” and providing some logic in the prompt for determining how to add up the last half of credits from the transcript and averaging the GPA.

In my project, the data parsing process was divided into two types: non-predefined keys and predefined keys.

5.1 Non-Predefined Keys

After the data cleaning stage, extracting data with non-predefined keys was relatively straightforward. A Python script was used to process the cleaned text files and extract the relevant information. The script, as shown in the code snippet included at the end of the article, performed the following tasks:

  1. Reading the cleaned text file and removing any unusable characters or short numeric strings.
  2. Defining exact match patterns to identify the beginning of each magazine entry (e.g., “Magazine:”, “Online:”, “Press:”).
  3. Splitting the text into individual magazine entries based on the match patterns.
  4. Extracting key-value pairs from each magazine entry, handling cases where values span multiple lines.
  5. Storing the extracted data in a dictionary format, with each magazine entry represented as a separate object.
  6. Converting the dictionary to a JSON string and writing it to an output file.

This process successfully extracted the non-predefined key data from the cleaned text files.

import json
import codecs

# Open the input text file containing the cleaned data from the OCR PDFs and the output file for the JSON.
IN = codecs.open("clmp-2005-2006.txt", "r", encoding="utf8")
OUT = codecs.open("clmp-2005-2006-submit.txt", "w", encoding="utf8")

# Read the input file, split it into lines, and filter out lines that are either too short (likely errors) or page numbers.
in_list = [
    s.strip()
    for s in IN.read().split("\n")
    if not (len(s.strip()) <= 1 or (len(s.strip()) <= 3 and s.isnumeric()))
]

IN.close()

datas = {}

# Patterns to identify the beginning of new entries in the text file. Define the exact matches.
patterns = ["Magazine:", "Online:", "Press:"]
match_list = []  # This list will contain the indices of lines where patterns are found.

# Loop through each line in the input list to find matches to the patterns.
for index, st in enumerate(in_list):
    match = [(True if st.find(pattern) == 0 else False) for pattern in patterns]
    if any(match):
        match_list.append(index)

length = len(in_list)
match_length = len(match_list)

print(length, match_length)  # Debugging print to check sizes.

currentIndex = 0
for i in range(0, match_length - 1):
    st, en = match_list[i], match_list[i + 1]
    currentIndex += 1
    data = {}
    idx = st
    while idx < en:
        # Split each line into a key-value pair at the first colon.
        key, value = map(str.strip, in_list[idx].split(":", 1))
        idx += 1
        # If the following line doesn't contain a colon, it's part of the previous value.
        while idx < en and ":" not in in_list[idx]:
            if not value.endswith("-"):
                value += f" {in_list[idx]}"
            else:
                value = value[:-1] + in_list[idx]
            idx += 1
        data[key] = value
    datas[currentIndex] = data

print(currentIndex)  # Debugging print to ensure the expected number of entries are processed. This must be the same as the first match_length. 

# Convert the dictionary to a JSON string, ensuring pretty printing.
datas_string = json.dumps(datas, sort_keys=False, indent=4, ensure_ascii=False)

# Write the JSON string to the output file and close it.
OUT.write(datas_string)
OUT.close()

5.2 Predefined Keys

Extracting data with predefined keys proved to be the most challenging aspect of the project. The keys were sometimes embedded within the text, requiring human analysis rather than automated script processing. This led to numerous missing values and multi-line comments.

To tackle this challenge, I integrated OpenAI API and GPT4, an AI-powered language model, was into the workflow. However, due to the large volume of data, direct processing was not feasible. Instead, the data was segmented into manageable chunks of around 70 lines each. These segments were then sent to GPT4 along with a prompt to extract the predefined keys within the specified context.

The prompt provided to GPT4 included instructions on the expected format of the extracted data (JSON) and the specific fields that might appear in the magazine listings. GPT4 analyzed each chunk of data and returned the extracted information in the desired JSON format, handling missing information and anticipating the beginning of the next article.

# Define a string 'to_system' that describes the role of the user and the type of data to be processed. This string serves as a structured set of instructions for an AI model to understand the context and the required output format.
to_system = """
You are a programmer.
...
"""

# Define a function 'evaluate_JSON' that takes a text input and returns formatted JSON.
def evaluate_JSON(text):
    # Initialize the flag, return dictionary, and title holder.
    f, ret, next_title = False, {}, ""
    
    try:
        # Call the OpenAI API client to generate completions using the provided model. Here, the conversation messages are structured to pass the 'to_system' instructions and the given 'text' as input for the AI to process.
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": to_system},
                {"role": "user", "content": f"'''\n{text}\n ... ...\n'''"},
            ],
        )

        # Process the response to extract the content, and format the JSON with newlines to make it more readable. This step assumes that the response from the AI will be in a string format that resembles JSON.
        rlt = response.choices[0].message.content
        rlt = rlt.replace("{", "\n{\n")
        rlt = rlt.replace("},", "\n},\n")

        # ... rest of the processing code should go here ...
        
    # Exception handling in case of an error during the API call or processing.
    except Exception as e:
        print(f"An error occurred: {e}")

    # The function would return the formatted JSON content.
    return ret

Conclusion

Converting scanned PDFs of the CLMP Directory of Literary Magazines into clean, well-structured JSON was a multi-step process that involved digitization of the directories, PDF to text conversion, data cleaning, and data extraction. The project utilized Python scripts and AI-powered tools like GPT to overcome challenges such as embedded keys and large data volumes.

By segmenting the data into manageable chunks and providing clear instructions to GPT4, I successfully extracted the desired information in a structured JSON format.

Result Example

I started with print issues that contained thousands of magazines that included variations of listing information like this

Screenshot of the listing information for a single literary magazine called ABACUS.

And I was able to turn all of the listing information across the issues into JSON like this

This project demonstrates the potential of combining traditional programming techniques with AI-powered tools to tackle complex data extraction tasks in the digital humanities field. The resulting JSON data can now be used for further analysis and research, opening up new possibilities for exploring the historical content of the CLMP Directory of Literary Magazines.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *