Digital preservation in a webinar, part four

Finally! The recap of the fourth Introduction to Digital Preservation webinar, hosted by ASERL.

[To listen to the recordings and view the power point presentations, see ASERL’s archive]

The title of this webinar: “Using FITS to Identify File Formats and Extract Metadata.” It was presented by Andrea Goethals, of Harvard University.

The highlights:

What is FITS?

  • “File Information Tool Set”

Some complications

  • Format specifications often have different versions.
  • Specifications are things that file formats conform to.
  • Authoritative specification information does not always exist for files. Sometimes it can be unclear, complex or long, it can reference other file formats, and can depend on other specifications.

Further complications for tool builders and users

  • OpenDoc formats are packaged as ZIP files, which information is not sufficient for preservation.
  • Many formats (e.g., XML) are text formats.
  • Some formats lack obvious identifying features.


  • File formats can be difficult to accurately identify.
  • Some are more specific than others (inconsistent).

How does FITS help?

  • Combines the functionality of different file format identification tools.

Why build FITS?

  • The motivation at Harvard was to offset the risk of accepting any format (including web archives, email attachments, donated external hard drives).
  • Additionally, to integrate into existing preservation workflows.
  • Strategy: to develop a tool manager instead of a tool, and to account for tool inaccuracy: to check tools against each other, and to verify results.

What is required?

  • XML, for tools without a graphics interface tool

What does FITS do?

  • Identifies many file formats
  • Validates a few file formats
  • Extracts metadata
  • Calculates basic file information
  • Outputs technical metadata
  • Identifies problem files (e.g., conflicting opinions on format, metadata values; unidentifiable formats)

The Process

  • FITS translates tool output to a common XML file type, consolidates them into one FITS XML format, and then translates the FITS XML file to standard XML.
  • You can store the FITS XML files wherever you store metadata in the repository.
  • The file is not modified during the process.

Normalization (translation)

  • The key to using multiple tools
  • Assists with tools that provide different names for the same format
  • Assists with tools that provide different values for the same metadata
  • Assists with tools that provide different ways of saying when they can’t identify the format of a file

[Then we watched nifty demonstrations in Windows as Andrea Goethals took us through what FITS does and how it does it. I discovered I can read basic XML.]

At Harvard University Libraries

  • They store metadata in XML form in a metaschema
  • Output is parsed and packaged
  • Some of FITS data fits well into PREMIS
  • Standard metadata block is added
  • Other information is included with administrative metadata

Questions & Answers

Q: Are there plans to integrate FITS into large systems/repositories?

A: ArchiveMata uses it. DuraCloud looked into it, but it is mostly used in individual repositories.

Q: Do you need to have the individual File Format Identification tools loaded locally?

A: All necessary tools are downloaded with FITS

Q: When FITS notes conflicts between tools’ results, how do you know which one is right?

A: Conflicts often occur in relatively unused formats. There is an XML file included that can be used to educate oneself, to determine if it is really a more specific version of a broader format. (It provides a format tree).

Where to find FITS

  • Download the link
  • OSS
  • The mailing list is good for new versions, other news.

My Take-aways

This whole series has been incredibly informative. Having listened to these experts talk about common/important tools that they use for digital preservation, I now have a better idea of not only the processes involved in digital preservation, but also how the different pieces fit together. The project planning information was pretty straightforward, and generally, not very different from many projects I’ve worked on in the past or learned about in library school.

Now that I know that some FITS information fits well into PREMIS, and that other information from FITS fits into administrative metadata sections, and that XML can carry them all, I have a better idea of how to use the metadata categories described in the second webinar.

I know which kinds tools are meant to be used for various tasks in digital preservation projects, and I know what I need to learn (and what I don’t) in order to use them. I can point to FITS and PREMIS and say that they may be used in the implementation stage.

Lastly, I know so much more about where to go to find out more about the tools, processes, best practices, and current projects.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s