The SCAPE (Scalable Preservation Environments) Project and OPF (Open Planet Foundation) ran a hackaton in Vienna, on 2-4 December, at the Austrian National Library. The hackaton addressed to developers and pratictioners and focused on Hadoop, an open source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale out from single servers to thousands of machines.
The hackaton investigated two digital preservation scenarios:
- Web-Archiving: File Format Identification/Characterisation
A web archive usually contains a wide range of different file types. From a curatorial perspective the question is: “Do I need to be worried? Is there a risk that means I should take adequate measures right now?”
The first step is to reliably identify and characterise the content of a web archive. Linguistic analysis can help categorise the “text/plain” content into more precise content types. A detailed analysis of “application/pdf” content can help cluster properties of the files and identify characteristics that are of special interest. Using the Hadoop framework and prepared sample projects for processing web archive content enables to perform any kind of processing or analysis that we come up with on a large scale using a Hadoop Cluster. It was discussed what are the requirements to enable this and find out what still needs to be optimized.
- Digital Books: Quality Assurance, text mining (OCR Quality)
The digital objects of the Austrian National Library’s digital book collection consists of the aggregated book object with technical and descriptive meta data and the images, layout and text content for the book pages. Due to the massive scale of digitization in a relatively short time period and the fact that the digitized books are from the 18th century and older, there are different types of quality issues. Using the Hadoop framework provides the means to perform any kind of large scale book processing on a book or page level. Linguistic analysis and language detection, for example, can help us determining the quality of the OCR (Optical Character Recognition) or image analysis can help in detecting any technical or content related issues with the book page images.
Highlights of this hackathon included:
- Talks from the guest speaker, Jimmy Lin, University of Maryland. Jimmy has been working with Big Data and Hadoop for many years, with a focus on natural language processing and information retrieval;
- Taking part in the competition for the best idea and visualization;
- A chance to gain hands-on experience carrying out identification and characterisation experiments;
- Practitioners and developers working together to address digital preservation challenges;
- The opportunity to share experiences and knowledge about implementing Hadoop.
Who should attend?
Practitioners (digital librarians and archivists, digital curators, repository managers, or anyone responsible for managing digital collections): they had occasion to learn how Hadoop might fit their organization, how to write requirements to guide development and gain some hands on experience using tools by themselves and finding out how they work;
Developers of all experience could participate: from writing their first Hadoop jobs, to working on scalable solutions for issues identified in the scenarios.
Pratictioners and developers worked together in groups to address digital preservation challenges using Hadoop. Pratictioners took the role of issue champion and articulated their requirements to the developers and documented them on the wiki. Developers brainstormed ideas and work on solutions to the issues. There was regular check in points to get feedback and refine requirements. The best issue champion and development solution got a prize.
All the participants gained practical experience of using digital preservation tools in characterization and quality assurance processes. Step-by-step worksheets were provided for those who were less familiar with using the command line and experts were on hand to help them.
There was plenty of opportunities for discussion, a session for sharing experiences, research projects reports and a break out space for lightening talks.
View the hackaton Agenda
Visit the event wiki page