Open Source Tools for Validation in the Digital Archive Workflow

Share

By Merle Friedrichsen from the German National Library of Science and Technology (view original post)

 

On 9 / 10 November I was able to attend the “No Time to Wait 2”, a conference on open source software in (film) archives held at the Austrian Film Museum in Vienna. In contrast to the first “No Time to Wait” conference the scope was widened to audiovisual preservation, open formats, and standardization of formats. As part of the program, I had the chance to present on “Open Source Tools in the Digital Archive Workflow” from our perspective.

 

Requirements for Open Source Tools for Validation

When we receive a file for our collection we will test if it fulfills the requests of the standard (e.g. does the PDF-file conform to the PDF-Standards?). There are different scenarios in which we use open source software for validation, and from each of these different requirements occur.

One requirement is that the tool is easy to implement into existing software and workflows, as we want to use the validation software within our archive framework. If we have implemented the tool, we can validate the files upon ingest into out archive. In another business case, we require a simple graphical user interface (GUI). With such a tool the acquisitions team can test the files as soon as they are received, and maybe even ask for another (valid) file – if this is possible. When we are working together with other institutions or as a service provider, we typically receive a large amount of files which we validate before we ingest them into out archive. Due to this is we require a command line interface (CLI) in order to automate the validation of a large amount of files recursively through different folders.

Of course the validation software should fulfill its purpose: checking against the standard of the format. But apart from the standard, an institution might have other requests – for example a specific resolution in TIFF files for a digitization project or a rule against embedded objects other than images and AV in PDF/A files. In order to easily enforce these, the tool should be able to check the files against a custom policy. As we want to store the report (success or failure) of the validation as preservation metadata, the report should be available in XML or a different structured format, to make it easy to integrate or process the output further. Any metadata schema will be appreciated, as it makes it more readable and (if applicable) the mapping to any other metadata schema would be easier (or already existing). The performance of the tool is another requirement.

In an ideal world, we would only receive valid files from data producers and service partners. A step to reach this goal would be an easy way to allow external providers to validate their files before they hand it over to the library. A web service, where anybody could upload files and check them against the standard (and a custom policy) would serve such a purpose. If the file is not valid, a repair-possibility for the error encountered in the file would help the data producer/provider, to hand over only valid files.

So, in no particular order a short overview of the requirements have regarding open source software for validation (besides from being open source):

  1. easy implementation into existing software and workflows
  2. simple Graphical User Interface
  3. Command Line Interface
  4. checking against file format standards
  5. checking against custom policy
  6. use standards for reports
  7. perform (fast) on a large amount of files
  8. webservice (standards and policies)
  9. repair-possibility

We‘ve tested two open source tools for different file formats: veraPDF (PDF/A), and mediaConch (for Matroska/FFV1 – for film). I am very happy to state that all of our requirements are fulfilled (or are possible to fulfill) by these tools.

 

VeraPDF

We have tested most of the requirements and the tool performs well. As we receive a lot of PDFs that do not need to fulfill the PDF/A standard but nevertheless must be without password, we created our own policy that checks whether a pdf is password encrypted or not. What we haven’t tested yet is the integration into our existing software, but the implementation is discussed in the Rosetta Format Library Working Group (FLWG), a user group responsible for – amongst other things – deciding which tools should be rolled out within the Rosetta archive framework. Until now I haven’t seen a web service based on veraPDF, but as the software is open source and well documented, it should be feasible to build one. Due to the fact that we could not find a file that could be repaired by veraPDF, we haven’t tested the given repair functionality. But it is possible to fix the PDF document metadata, e.g. if a file does not conform to the standard, the PDF/A flag can be removed automatically.

 

MediaConch

Most of our requirements were tested and are fulfilled by MediaConch. There are only a few requirements that we have not tested yet. One of these is the implementation into existing software – as with veraPDF, the FLWG is also discussing the integration of MediaConch. On the other hand we do not have special requirements for mkv / ffv1 files yet, so we couldn’t test a custom policy – simply because we currently do not have a custom policy. Due to the implemented checksums in ffv1 and Matroska there are several “repair-possibilities” for a file with a bitflip. But as we haven’t encountered a corrupted file (yet), we haven’t tested this possibility.

 

Conclusion

Both veraPDF and MediaConch are suitable for our needs regarding digital object validation. I have to admit that I had a lot of fun testing these new tools, looking into the reports and figuring out how to write my own policy. It is worthwhile to work with open source tools – especially when they are designed to fit the needs of a (digital) archive. And it is worth investing (money and / or time) to foster or to enhance these tools!

If you are interested in the conference you will find the recordings on the youtube channel of the conference.

Thanks to the organizers of this conference – it was inspiring and encouraging!

 

 

Visit the PREFORMA Blog

Visit the PREFORMA Website

Leave a Reply


Related Articles

Interview with Brian E. Davis of Oregon State University
This is the nineth in a series of interviews with people using MediaConch within their institutions. Brian is the head of the Digital Production Unit for the Special Collections & Archives Research Center at Oregon State University Libraries & Press. He is using MediaConch for validation and policy checking during the quality control process for the video files. He also used DPF Manager and veraPDF for quality control of TIFF and PDF files.
Shaping our future memory standards (2/2)
Hosted by the National Library of Estonia, the PREFORMA International Conference "Shaping our future memory standards" brought together 150 people worldwide to discuss the importance of standardisation and file format validation for the long term preservation of digital cultural content, discover the potential of the open source conformance checkers developed in PREFORMA and look at future challenges and opportunities.
Interview with Julia Kim
Julia Kim is the Digital Assets Specialist at the American Folklife Center at the Library of Congress. So far, she has primarily used MediaConch to create reports for new and incoming born-digital video from the Civil Rights History Project and the DPX files from digitizing celluloid film.
PREFORMA: smart solutions for digital preservation
Digital preservation means taking precautions to ensure long-term access to digital content. It mitigates the risk of files becoming obsolete or unusable in the future. The PREFORMA tools help you validate incoming file formats and codecs against their standard specification, define custom acceptance criteria, and build an efficient ingest workflow. Download and try them from the PREFORMA Open Source Portal!