The following project pitch was submitted to the National Science Foundation ‘s (NSF) Seed Fund program on 25-Sep-2019, under the “SBIR: Artificial Intelligence (AI)” Topic Area. The project was subsequently accepted by the relevant NSF Program Director on 19-Oct-2019, resulting in an invitation to submit a full proposal for the NSF SBIR/STTR Phase I program. This invitation is valid for a period of one year, expiring on 19-Oct-2020.
Briefly Describe the Technology Innovation.
Up to 500 words describing the technical innovation that would be the focus of a Phase 1 project, including a brief discussion of the origins of the innovation as well as an explanation as to why it meets the program’s mandate to focus on supporting research and development (R&D) of unproven, high-impact innovations.
Significant resources are expended to digitize the physical collections of Cultural Heritage Organizations, and in making these collections widely accessible and useful. The creation of associated metadata is crucial for the usability of each collection, and accounts for a significant portion of total project cost. The creation of metadata accounts for 29% of total digitization costs, with the manual nature of descriptive metadata generation being a major factor.
Despite these expenditures, the metadata found on existing collections is often incomplete, inconsistent and incorrect. Personally, this became obvious while attempting to advance a research project by leveraging some of these collections, only to be frustrated by sparse and non-standard metadata. Following this experience, I resolved to better understand and attempt to address the underlying problems.
This project aims at significantly reducing the digitization costs of cultural heritage collections, while simultaneously improving their usefulness. I believe this is attainable via the novel application of soft computing techniques, toward the following objectives:
- The automated generation of descriptive metadata aligned with widely used taxonomies.
- The temporal and spatial labeling of unidentified documents.
- The automated detection of errors in existing descriptive metadata.
At its core, the solution involves training a collection of classification models to recognize specific attributes of documents, covering both subject matter (e.g., “portrait”) and format (e.g., “daguerreotype”). Each classification model would be treated as a “membership function” defining the fuzzy set of documents containing that attribute.
For metadata generation, we associate each membership function with the relevant node of a standard taxonomy, for example, the Library of Congress Subject Headings (LCSH). A given document would then receive the taxonomical tags associated with each set to which it belongs. As an example, any document found to be a member of the “daguerreotype” set, would receive the LCSH tag “sh85035408”, which covers daguerreotypes.
A subset of our membership functions will define “temporally differentiable” attributes. That is to say that a document’s membership in these sets tells us something about when the document was created. With these membership functions, and a large pre-labeled dataset, we can construct temporal histograms reflecting the historical distribution of each attribute. A similar approach can be taken to understand spatio-temporal distributions, as appropriate.
These distributions can be used both to make predictions of unidentified documents (as documents will likely abide by the observed distributions), and to uncover potentially incorrect metadata via outlier identification. The specificity of our predictions can be increased by combining the distributions of multiple attributes, and by “breaking” attributes into more-specific categories.
Though some relevant research has been conducted into the application of similar techniques to subsets of this problem (e.g., the temporal identification of color photographs) it appears that little effort has been applied to the practical, systematic application of these techniques to meet the specific requirements of this domain. Given this, and the potential value of a practical solution, this is believed to be a worthy area of investment.
Briefly Describe the Technical Objectives and Challenges.
Up to 500 words describing the R&D or technical work to be done in a Phase I project, including a discussion of how and why the proposed work will help prove that the product or service is technically feasible and/or significantly reduces technical risk. Discuss how, ultimately, this work could contribute to making the new product, service, or process commercially viable and impactful. This section should also convey that the proposed work meets definition of R&D, rather than straightforward engineering or incremental product development tasks.
It is believed that a successful phase 1 implementation can be limited in scope to processing photographs in batch, but should demonstrate an ability to:
- Automatically generate taxonomical tags for a representative set of descriptive metadata attributes.
- Make reasonable, evidence-based temporal predictions of items.
- Identify items which may have incorrect temporal metadata labels.
By proving these capabilities, we can demonstrate an approach to reducing the costs associated with manually creating new metadata, as well as an approach to improve the quality of existing metadata. With this promise, it is believed that we can encourage the participation of a number of target customers, whose engagement will be vital in helping guide a solution toward product-market fit.
The work involved in demonstrating these capabilities can coarsely be categorized as follows:
- Acquire a relatively large labeled dataset of historic photographs.
- Select a representative set of metadata attributes.
- Train membership functions for each of the selected attributes.
- Develop a process for automatically generating a list of taxonomical metadata labels based on set membership.
- Develop a process for automatically predicting temporal labels based on set membership.
- Develop a process for discovering erroneous labels by identifying temporal outliers.
It is generally understood that progress in soft computing is constrained by the availability of labeled data. Though some relevant research appears to have been affected by this constraint, it is not believed to be a significant impediment to our progress. This is due to the efforts of a handful of organizations focused on the aggregation and standardization of the metadata associated with many relevant digital collections, for example, the Digital Public Library of America (DPLA). Their work largely eliminates the risk that would otherwise be inherent in #1 (above).
Therefore, based on current understanding, it is believed that most of our technical risk lies in items #3, #5, and #6.
Regarding the training of membership functions, it is unknown to what extent this will require the use of fine-grained classification techniques. Given the relative immaturity of proven fine-grained techniques (as compared to coarse-grained approaches) this may require additional work to assess and implement an effective solution.
Regarding the prediction of temporal labels, there exists limited relevant research and no known practical implementations. Though we believe the use of observed distributions, in combination with approaches outlined by prior research efforts, will result in a viable solution, a rigorous empirical assessment of the approach is warranted. Given the availability an existing labeled dataset, such an assessment appears feasible.
Technically, the automatic identification of erroneous labels is closely tied to the work of predicting temporal labels, and shares much of the same technical risk. The quantitative assessment of the approach, though, appears less obvious, and may require coordination with the publishing organizations for confirmation.
Given the study required, and the novel application of emerging techniques within this domain, the described work appears well aligned with the NSF’s stated R&D requirements.
Briefly Describe the Market Opportunity.
Up to 250 words describing the customer profile and pain point(s) that will be the near-term commercial focus related to this technical project.
The initial customer focus includes organizations involved in the digitization or aggregation of cultural heritage collections:
- Cultural Heritage Organizations (CHOs), including libraries, museums, archives, and historical societies.
- Aggregation Organizations, e.g., the Digital Public Library of America (DPLA).
- Digitization Service Organizations, e.g., Everpresent, and Backstage Library Works.
Digitization projects include significant manual effort related to the curation of descriptive metadata. Requirements for this curation process are typically defined by the publishing organization, though work can fall to any of the listed organization types, depending on the structure of the project.
The inherent manual effort of this process leads to higher costs and potential errors. Given the constrained budgets of these projects, this represents the significant pain point we hope to address.
In estimating the number of CHOs in the US, we determine there to be roughly:
- 9,000 public libraries.
- 3,000 academic libraries.
- 5,000 historical societies and preservation offices.
Though a full picture of digitization project expenditures is unclear, grants awarded by the National Endowment for the Humanities (NEH) and the Council on Library and Information Resources (CLIR) is instructive:
- NEH awarded $230,000,000 from 2008-2017, initiated by their related divisions (the offices of Preservation and Access, and Digital Humanities).
- CLIR has awarded $4,000,000 annually since 2015, as part of their Digitizing Hidden Collections program.
Though not part of our initial focus, we believe that opportunity may also exist with international CHOs, collection management software (CMS) providers, and non-CHO archives.