Home | Data Recovery

Recovering the Business and Economic History of New York City Through Large-Scale City Directory Data

Mission

The collection of business and residential data is a societal practice that has existed well beyond the current capabilities of modern computational analysis. These historical records exist outside of the realm of what one might typically consider being “computer-friendly data”; prior to contemporary computing capabilities, printed records were trapped within physical books, and could only be accessed one tome, one page, one line at a time. Through the application of modern machine learning techniques, these printed directory data can now be extracted and digitized, allowing for mass consumption and analysis.

The New York Public Library holds several directories of this type within their archives. In 2016, many of these records were digitized, with the end goal being the creation of a searchable database of historical information. The initial attempt at extracting and classifying the data (from directories dated between 1849 and 1922) using OCR methods was successful with an estimated 90% accuracy rate. Our mission is to explore both the methods used and the data extracted in hopes of finding ways to breath life into this project once more.

Goals

Our capstone centers on the previously completed research, with three primary goals: exploring opportunities for increasing the accuracy of the data extraction methodology, improving upon the reproducibility of the methods previously used, and exploring the extracted data to demonstrate the potential that these data have for expanding our understanding of 19th century New York City.

1 – ACCURACY

We hope to explore the possibility of deploying novel methods to re-shape and improve the extraction methods in order to improve the accuracy of the data

2 – REPRODUCIBILITY

We strive to complete our work using open-source and fully reproducible methods, so as to allow for anyone interested in this work to continue our efforts.

3 – EXPLORATION

We will conduct broad analyses over the extracted data in hopes of showcasing the possibilities for future analyses

Scope

The scope of our extraction methods will begin with 2-column directory pages, which comprise the majority of the data. The scope of the analyses will consider the data previously extracted from directories between 1849 and 1879, as these entries have been geotagged with latlong values, allowing for geospatial analysis.