Digital Humanities and Digital Preservation: a new series

Last weekend, I attended the very educational and inspiring, fun and interesting Data Driven: Digital Humanities in the Library conference at the College of Charleston. I have a lot of information to digest, and in the next few posts I will write a series of some of my notes, and some implications for my projects at the DC/SLA.

In this post, I begin with my notes from the pre-workshop readings for “From Theory to Action: A Pragmatic Approach to Digital Preservation Strategies and Tools” Workshop at the conference in Charleston, SC, June 20-22 2014.

 

Pre-workshop readings:

NDSA Levels of Preservation (an assessment tool for institutions and organizations)

You’ve Got To Walk Before You Can Run (high-level view of the basic requirements to make digital preservation operational)

Walk This Way (detailed steps for implementing DP – introductions to each section were recommended reading)

Library of Congres DPOE (optional)

POWRR website (optional – the group that taught the workshop is POWRR – Lots of good here)

 

NDSA Levels of Preservation – Where I see the DC/SLA Archives Committee:

  1. Storage and Geographic location
    1. Level 0 – still determining where things are, how they have been stored
  2. File fixity and Data integrity
    1. What is fixity? (I learned fixity is, for example, running checksums to determine if materials/digital objects have changed or been corrupted over time. Checksums are algorithm-produced unique identifiers that correspond to the contents of a file and are assigned to a specific version of a file or item).
  3. Information Security
    1. Level 0-1 – We have determined in policy documents who *should* have read authorization (the general public, in most cases, with some redactions/delays in dissemination for PII and financials)
    2. The Archives Committee will be the only ones, aside from possibly a Board liaison, to have other authorizations (edit, delete, etc.)
  4. Metadata
    1. Level 0 – We will soon be conducting an inventory of content, which will include an investigation into what metadata has been included
  5. File formats
    1. Level 0 – We will soon determine what formats have been and should be used

So, clearly, we still have a lot of work to do.

“You’ve got to walk before you can run: first steps for managing born-digital content received on physical media” (OCLC/Ricky Erway, 2012)

  • Audience: those who have or are currently acquiring such born-digital materials, but have not yet begun to manage them
  • identifying and stabilizing holdings
  • Four Essential Principles
    • Do no harm (to physical media or content)
    • Don’t do anything that unnecessarily precludes future action and use
    • Don’t let the first two principles be obstacles to action
    • Document what you do!!
  • Survey and Inventory Materials in your Current Holdings
    1. Locate existing holdings
      1. Gather info about digital media already in collections
      2. Do collections inventory to locate computer media in any physical form
    2. Count and describe all identified media (NOT mounting or viewing content on media)
      1. Gather info from donor files, acquisition records, collections, etc.
      2. Remove media but retain order by photographing digital media and storing printouts in physical collection
        1. Alternative: place separator sheets in physical collection
      3. Assign appropriate inventory # / barcode to each physical piece
      4. Record location, inventory #, type of physical medium, any identifying info found on labels /media, e.g., Creator, Title, etc.
      5. Record anything known about hardware, operating system, software; use consistent terms
      6. Count # of each media type, and indicate max capacity of each media type, max amount of data stored, then calculate overall total for the collection
      7. Return physical media to suitable storage
      8. Add summary description of digital media to any existing accession accession record, collection level record, or finding aid
    3. Prioritize collections for futher treatment, based on:
      1. value, importance, needs of collection as a whole and level of use (anticipated use) of collection
      2. whether there is danger of loss of content
      3. whether appears to be significant digital content not replicated among analog materials
      4. whether use of digital content that is replicated in analog form would add measurably to users’ ability to analyze or study content
      5. when just a few files can be represented on a page; whether printouts might suffice
    4. Repeat these steps every time you receive new media.

 

Walk This Way (OCLC/Julianna Barrera-Gomez, and Ricky Erway, 2013)

  • Draft a workflow before beginning? Revise during execution?
  • Existing digital preservation policies may include donor agreements (which can explain what info may be transferred from digital media) and policies on accessioning or  de-accessioning records or physical media
  • Consult policies (IT?) on software use or server backups
  • AIMS project 2012 report about digital collection stewardship provides objectives for informing policy and glossary for non-archivists
  • Documenting the project
    • What info about the process will be needed in future to understand scope, steps taken, and why?
    • provides context to ensure process; forms key part of evidence for provenance; indicates authenticity of material
    • manage associated metadata (auto-generated or manually created)
    • content management systems: Archon, Archivist’s Toolkit: use to create accession records to link from project’s documentation to other holdings
    • Create a physical project directory with folders
      • subfolders:
        • Master Folder (Preservation Copy, Archival Copy Folder)  – holds master copies of files
        • Working Folder – holds working copies of master files
        • Documentation Folder – to hold metadata and other information associated with the project
  • Preparing the Workstation (Mandatory) – this may be a problem, unless we find a way around having a physical workstation for preservation work.
    • dedicated workstation to connect to source media
    • start with a single type of media from a collection to aid efficiency and keeping track of materials, metadata.
    • What alternatives to this? Physical space and financial obstacles for DC/SLA
    • Use a computer that is regularly scanned for viruses
    • consider keeping it non-networked until a connection is needed (e.g., for file transfers, software/virus definition updates)
    • DO NOT open files on source media!
  • Connect the source media
    • Examine media for cracks/breaks/defects
    • Consider removing sticky notes or other ephemera (take digital photo first)
    • DO NOT attempt to open files yet!
  • Transfer Data
    • Copy files or create a disk image
      • Copy files individually or in groups – practical way for new archivists to get started
      • Disk image – more info is captured, easier to ensure authenticity. Makes exact, sector-by-sector bit stream copy of a disk’s contents, retaining original metadata. Make a single file containing an authentic copy of the files and file system structure on a disk.
        • Forensic images image everything, including deleted files and unallocated space. Logical copies omit deleted files and unallocated space. 
  • Check for viruses
  • Record the file directory
    • Make a copy of the directory tree
  • Run Checksums or Hashes
    • unique value, based on contents of a file and is generated by specific algorithms (different ones – consistency is important)
    • identify whether/when a file has changed
    • regularly hashing a file or image you have copied and checking those new hashes against the hashes made at the time of the transfer should be part of your digital curation workflow
  • Securing project files
    • consolidate documentation
  • Prepare for Storage
    • arrange for space on a backed-up network server that is secure
  • Transfer to a secure location
    • additional copies – preservation master copies that must be kept safe from unintentional alteration
  • Store or de-accession source media
    • if destruction, use a secure method in conjunction with donor agreement and policies
  • Validate file types
    • determine whether you can open and read the contents of digital files (from the working copies!)
    • use working copies
    • hex editors – show file properties (byte representation)
  • Assess Contet (optional)
    • use working copies
  • Reviewing files 
    • only working copies
  • Finding duplicate files
    • if you delete, you will need to delete from the Master Folder already moved to secure storage
  • Dealing with Personally Identifying or Sensitive information
    • sensitive information must be kept restricted and secure on workstations, file servers, backup or transfer copies
    • Redact or anonymize before making available to users 

Book blogging and HTML: Linking Images

I’ve recently started a book blog (check it out), in which one of the features is a set of “read-alikes” placed at the end of each post. They take the form of cover images. I really like the idea of linking those images to lead readers to more information about these books (from Goodreads, at the moment). With a link to a description and social media/community reviews, they “don’t have to take my word for it!”

Although I haven’t memorized the HTML code yet, I have re-figured out how to do this a few times (precisely because I haven’t been able to memorize it). So in this post, I’m going to go through the steps I use to link a cover image to a Goodreads book description page. Continue reading

The challenges of personal digital archiving

Personal digital archiving is a topic  everyone is talking about everywhere lately, it seems. How do you preserve your (worth saving) financial, personal, and business documents, your emails and messages, pictures, videos, movies, social media interactions/posts, blogs and internet files?

Today’s infographic is from Doghouse Diaries, and discovered via the Library of Congress’ Digital Preservation blog, The Signal. (By the way, if you’re interested in what’s happening in the world of digital preservation, follow The Signal.)

2013-06-21-2506605

To help you resolve your personal digital archiving woes, the Library of Congress has a great set of guidelines.

Digital preservation in a webinar, part four

Finally! The recap of the fourth Introduction to Digital Preservation webinar, hosted by ASERL.

[To listen to the recordings and view the power point presentations, see ASERL’s archive]

The title of this webinar: “Using FITS to Identify File Formats and Extract Metadata.” It was presented by Andrea Goethals, of Harvard University.

The highlights:

What is FITS?

  • “File Information Tool Set”

Some complications

  • Format specifications often have different versions.
  • Specifications are things that file formats conform to.
  • Authoritative specification information does not always exist for files. Sometimes it can be unclear, complex or long, it can reference other file formats, and can depend on other specifications.

Further complications for tool builders and users

  • OpenDoc formats are packaged as ZIP files, which information is not sufficient for preservation.
  • Many formats (e.g., XML) are text formats.
  • Some formats lack obvious identifying features.

Implications

  • File formats can be difficult to accurately identify.
  • Some are more specific than others (inconsistent).

How does FITS help?

  • Combines the functionality of different file format identification tools.

Why build FITS?

  • The motivation at Harvard was to offset the risk of accepting any format (including web archives, email attachments, donated external hard drives).
  • Additionally, to integrate into existing preservation workflows.
  • Strategy: to develop a tool manager instead of a tool, and to account for tool inaccuracy: to check tools against each other, and to verify results.

What is required?

  • XML, for tools without a graphics interface tool

What does FITS do?

  • Identifies many file formats
  • Validates a few file formats
  • Extracts metadata
  • Calculates basic file information
  • Outputs technical metadata
  • Identifies problem files (e.g., conflicting opinions on format, metadata values; unidentifiable formats)

The Process

  • FITS translates tool output to a common XML file type, consolidates them into one FITS XML format, and then translates the FITS XML file to standard XML.
  • You can store the FITS XML files wherever you store metadata in the repository.
  • The file is not modified during the process.

Normalization (translation)

  • The key to using multiple tools
  • Assists with tools that provide different names for the same format
  • Assists with tools that provide different values for the same metadata
  • Assists with tools that provide different ways of saying when they can’t identify the format of a file

[Then we watched nifty demonstrations in Windows as Andrea Goethals took us through what FITS does and how it does it. I discovered I can read basic XML.]

At Harvard University Libraries

  • They store metadata in XML form in a metaschema
  • Output is parsed and packaged
  • Some of FITS data fits well into PREMIS
  • Standard metadata block is added
  • Other information is included with administrative metadata

Questions & Answers

Q: Are there plans to integrate FITS into large systems/repositories?

A: ArchiveMata uses it. DuraCloud looked into it, but it is mostly used in individual repositories.

Q: Do you need to have the individual File Format Identification tools loaded locally?

A: All necessary tools are downloaded with FITS

Q: When FITS notes conflicts between tools’ results, how do you know which one is right?

A: Conflicts often occur in relatively unused formats. There is an XML file included that can be used to educate oneself, to determine if it is really a more specific version of a broader format. (It provides a format tree).

Where to find FITS

fits.googlecode.com

  • Download the link
  • OSS
  • The mailing list is good for new versions, other news.

My Take-aways

This whole series has been incredibly informative. Having listened to these experts talk about common/important tools that they use for digital preservation, I now have a better idea of not only the processes involved in digital preservation, but also how the different pieces fit together. The project planning information was pretty straightforward, and generally, not very different from many projects I’ve worked on in the past or learned about in library school.

Now that I know that some FITS information fits well into PREMIS, and that other information from FITS fits into administrative metadata sections, and that XML can carry them all, I have a better idea of how to use the metadata categories described in the second webinar.

I know which kinds tools are meant to be used for various tasks in digital preservation projects, and I know what I need to learn (and what I don’t) in order to use them. I can point to FITS and PREMIS and say that they may be used in the implementation stage.

Lastly, I know so much more about where to go to find out more about the tools, processes, best practices, and current projects.

Digital preservation in a webinar, part three

I not-so-recently “went to” the third Introduction to Digital Preservation webinar hosted by ASERL (Association of Southeastern Research Libraries).

[To listen to the recordings and view the power point presentations, see ASERL’s archive]

This webinar was titled: “Management of Incoming Born-Digital Special Collections. Presented by Gretchen Gueguen, of the University of Virginia.

Without further ado, my notes:

What is born-digital?

There are two layers:

  1. Content
  2. Supporting software and operating system(s) (OS)

The same software/OS can be used for multiple files.

The Crucial Dependency

  • Hardware. Including (but not limited to) ports, wires, ribbons, drives, connectors
  • Translation between older and newer hardware can be achieved by write-blockers

The process

Imagine a doughnut.

“Preserve” is positioned in the doughnut hole, smack dab in the middle. Around the edges of the donut (starting on the left and moving clockwise, if you’re curious) reside:

  • “Provide Access”
  • “Appraise”
  • “Accession”
  • “Arrange/Describe”

Appraisal

Including old collections and new collections, legacy material (that has already been collected).

Do you further process these legacy collections, or deaccession them?

Appraisal Phase 1: Inventory

(the following list contains information/data you may want to collect in the inventory phase)

  • Disk #
  • ID #
  • Collection name/title
  • Record # (MARC or EAD)
  • Media type
  • Manufacturer
  • Capacity
  • Date (from label info)
  • Color
  • Damage
  • Label info

The above information can be used in cataloging, and can help future identification/location of items in the collection.

It may be necessary to:

  • Research accession records
  • Search the stacks
  • Conduct a physical survey of a statistically significant sample of disks

Appraisal Phase 2: Evaluate

Legacy collections

  • Available resources (work, costs)
  • File types and formats present in materials
  • Volume of data vs. capacity to take it
  • Condition of content: changed? corrupted?
  • Dependencies on software, hardware
  • Institution’s commitment to the content
  • Migration or transformation required?
  • Can you appraise/view the intellectual content?

New acquisitions

  • Policy framework (update it frequently and proactively)
  • What capacity do you have for acquiring new born-digital collections?
  • How will you deal with certain scenarios?
  • Do you need special hardware to read the content?
  • Do you have that hardware?
  • Does someone else? (e.g., eBay) : Given the scarcity of obsolete hardware, there is a growing interest in sharing equipment
  • Is the disk/drive natively Read Only?

Accessioning

Hardware types

  • Zip
  • DVD/BluRay
  • JAZ
  • Others (e.g., floppies) are difficult
  • Write blockers/forensic bridges: Hardware devices or software that block any writing onto disks (e.g., Tableau, Wiebe Tech; SAFE Block XP, MacForensicLab)

Software barriers

How to transfer the data to a new medium?

1. Disk imaging – one file, bit-level copy

  • Captures unused space, sometimes called “file slack,” made up of binary zeros, can take up a lot of space
  • Benefits: compact, single file, intact, complete
  • Drawbacks: can capture unwanted data, requires specialized tech, can transfer across write-blocker if file is still readable

2. Logical imaging – select what you want and create an image

Transfer using (examples)

  • NTFS (New Tech File System)
  • MAC : HFS (Hierarchical File System)

Rendering files

  • Transfer methods: over a network, using Duke Data Accessioner, or Bagit, or FTP transfer tool such as FileZilla, CyberDuck (how about these names?).
  • Web harvesting (e.g., Internet Archive)
  • Save to modern media (CD, external hard drive)
  • Image the hard drive in person

Management

Is the file corrupted, lost, or changed?

  • Checksums. If these haven’t changed, the file hasn’t changed.
  • Check for viruses (stabilizing material): Do this in an un-networked space BEFORE uploading the files to a network!
  • Search for Personally Identifiable Information (PII)
  • Search for duplicate files using checksums.

Arrange/Describe

Metadata

  • Use media inventory
  • File inventory of contents (e.g., date, size, file name, type)
  • Extract technical, forensic, and preservation metadata (using PREMIS, PBCore, for examples)
  • Use a spreadsheet if you don’t have fancy infrastructure to record this information

Storage

  • Make multiple copies! (Lots of Copies Keep Stuff Safe, heh heh)
  • Use repositories or a managed service system for metadata and storage
  • If you don’t have one, how will you store and track content? (Spreadsheet and storage database)

Questions (a selection)

Q: Do you have any rules of thumb for materials NOT to accession?

A: The folks at the University of Virginia have not seen anything that they have decided not to take – nothing too unusual. Make sure you have access to the hardware to read the data/content. For some formats UVA doesn’t actually have, they obtained copies of the software from the donor.

Q: Do you manage the bit-stream or physical for commercially produced materials such as DVDs related to other materials?

A: Only physical management at the moment.

Q: Does UVA’s gift agreement contain language for digital preservation?

A: The agreement does state that the donor will agree not to offer the same content to other sources or institutions. It provides information about intellectual property rights. UVA reserves the right to do whatever is needed to preserve the content. Allows donors to ask for access restrictions. It does not contain any statement to the effect that UVA agrees to preserve content via a particular material or for a specific time.

Final words

Appraisal and accession are CRUCIAL.

Metadata is important – use checksums, spreadsheets.

Consider consortia – have someone else read the disks you can’t, and vice versa.