SCAPE and OPF’s hackaton on Hadoop

Share

scape_logoThe SCAPE (Scalable Preservation Environments) Project and OPF (Open Planet Foundation) ran a hackaton in Vienna, on 2-4 December, at the Austrian National Library. The hackaton addressed to developers and pratictioners and focused on Hadoop, an open source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale out from single servers to thousands of machines.

The hackaton investigated two digital preservation scenarios:

  • Web-Archiving: File Format Identification/Characterisation

A web archive usually contains a wide range of different file types. From a curatorial perspective the question is: “Do I need to be worried? Is there a risk that means I should take adequate measures right now?”

The first step is to reliably identify and characterise the content of a web archive. Linguistic analysis can help categorise the “text/plain” content into more precise content types. A detailed analysis of “application/pdf” content can help cluster properties of the files and identify characteristics that are of special interest. Using the Hadoop framework and prepared sample projects for processing web archive content enables to perform any kind of processing or analysis that we come up with on a large scale using a Hadoop Cluster. It was discussed what are the requirements to enable this and find out what still needs to be optimized.

  • Digital Books: Quality Assurance, text mining (OCR Quality)

The digital objects of the Austrian National Library’s digital book collection consists of the aggregated book object with technical and descriptive meta data and the images, layout and text content for the book pages. Due to the massive scale of digitization in a relatively short time period and the fact that the digitized books are from the 18th century and older, there are different types of quality issues. Using the Hadoop framework provides the means to perform any kind of large scale book processing on a book or page level. Linguistic analysis and language detection, for example, can help us determining the quality of the OCR (Optical Character Recognition) or image analysis can help in detecting any technical or content related issues with the book page images.

Highlights of this hackathon included:

  • Talks from the guest speaker, Jimmy Lin, University of Maryland. Jimmy has been working with Big Data and Hadoop for many years, with a focus on natural language processing and information retrieval;
  • Taking part in the competition for the best idea and visualization;
  • A chance to gain hands-on experience carrying out identification and characterisation experiments;
  • Practitioners and developers working together to address digital preservation challenges;
  • The opportunity to share experiences and knowledge about implementing Hadoop.

 

OPF logoWho should attend?

Practitioners (digital librarians and archivists, digital curators, repository managers, or anyone responsible for managing digital collections): they had occasion to learn how Hadoop might fit their organization, how to write requirements to guide development and gain some hands on experience using tools by themselves and finding out how they work;

Developers of all experience could participate: from writing their first Hadoop jobs, to working on scalable solutions for issues identified in the scenarios.

Competition

Pratictioners and developers worked  together in groups to address digital preservation challenges using Hadoop. Pratictioners took the role of issue champion and articulated their requirements to the developers and documented them on the wiki. Developers brainstormed ideas and work on solutions to the issues. There was regular check in points to get feedback and refine requirements. The best issue champion and development solution got a prize.

All the participants gained practical experience of using digital preservation tools in characterization and quality assurance processes. Step-by-step worksheets were provided for those who were less familiar with using the command line and experts were on hand to help them.

There was plenty of opportunities for discussion, a session for sharing experiences, research projects reports and a break out space for lightening talks.

 

View the hackaton Agenda

Visit the event wiki page

 

Leave a Reply


Related Articles

APARSEN/SCAPE project Satellite Event – Long term accessibility of digital resources in theory and p...
An overview on management aspects such as digital rights management, policies and costs as well as technical aspects with a focus on preservation planning and scalability in digital preservation will be given. Insights into the day-to-day practice of digital preservation will foster the understanding of theoretical concepts developed in the two EU funded projects.
E-ARK project webinars
During November 2016, Open Preservation Foundation is running two webinars with E-ARK to introduce their key outcomes. E-ARK is a multinational big data research project that aims to improve the methods and technologies of digital archiving, in order to achieve consistency on a Europe-wide scale. The topics will cover the interoperability specifications developed by the project and the tools which implement these specifications.
Pre Commercial Procurement for the long-term Preservation of Digital Cultural Heritage
Results of the webinar organised by the Open Preservation Foundation to present the possibilities offered by the Pre-Commercial Procurement, a competition-like method designed to steer the development of innovative solutions towards concrete public sector needs, and by PREFORMA, a PCP project co-funded by the European Commission to work on the long-term preservation of digital data
Open Preservation Foundation publishes Annual Report
The 2014-2016 Annual Report has been published by Open Preservation Foundation, looking back on the activities and achievements undertaken by the Foundation over the past two years. veraPDF, one of the three open source projects funded in the framework in PREFORMA, is considered a crucial technical success by providing the definitive implementation of a PDF/A validation tool, which was released in version 0.14 earlier this year.