Data Extraction | Data Recovery

Extraction Methodology

Below is a step-by-step account of our source material and the methods deployed to enhance the data extraction process from the pages.

1– Understanding the Page Layout

The page images for this project came from the New York Public Library archive, dated between 1849 and 1922. These images were digitized into TIFF files and shared with us for the sake of text extraction. The pages are structured similarly across the years: columns of printed entry data, with each entry made up of four primary elements: name, occupation, business address, and (inconsistently) home address. Early directories (the first 41) divide the pages into two columns; later directories include three, four, and five columns of entries per page. Later directories also intersperse advertisements throughout the pages, both within and bordering the columns. Take note of the indented lines in the sample page to the right – these indentations indicate a continuation of the entry directly preceding, and were our primary point of intervention for improving the extraction.

2– Isolating the Columns

We deployed a Canny edge detector to identify the edges contained within the page image, followed by the Hough transformation to find the straight lines separating the columns. We then used OCR to identify the outer margins of the page image, so that the image could be cropped, resulting in a page image that only contained the text of the entries we were interested in capturing. Using the lines identified via the Hough transform, we were able to crop out each column, allowing for isolated analyses with reduced noise.

Running OCR over the cropped column allowed us to capture the pixel placement of each line – via the resulting bounding boxes that formed over the text – and, more importantly, capture which lines were indented in the column (an indent indicates that the line is a continuation of the entry directly preceding). Organizing the resulting textual output into a dataframe, we then group the data by line, and affix text that has been identified as “indented” to the text of the previous line.

We used Regular Expression to deploy a simple tagging methodology based on the syntax of each entry: words beginning with capital letters were tagged as “names”; the word immediately after the names was tagged “occupation”. The remaining text was tagged as “work address,” unless there was an “h.” present, which was always followed by “home address”.

3– Extracting & Tagging the Data

4- Scaling Up

The final step – which gave us trouble - is scaling up this process across all of the directories. Our attempt at scaling up the methodology across the 41 two-column directories yielded less optimistic results than when deployed over a single page.. The process resulted in slightly over 125,000 entries extracted; though we were excited to see that we had succeeded in deployment, this number was much lower than expected. Investigation into the output revealed that several pages were skipped over entirely, including 5 entire directories. However, the entries that were extracted were tagged consistently and accurately using our new tagging methodology.